Open MBJean opened 3 years ago
If there are various Lookup errors that occur when you run pytest, please note that you might need to install resources/modules manually on your local machine, such as by running import nltk and then nltk.download('stopwords').
When running pytest
, Python will open up (presumably trying to create data visualizations) but no visualizations will be created and the program will stall and needs to be forcibly terminated. The cause of this issue and its solution is unclear.
When running pytest
, after passing all tests, there is an error that seems to arise from the packages the product requires--the error looks something like:
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "...\gender_analysis\venv\lib\site-packages\clint\packages\colorama\initialise.py", line 17, in reset_all
AnsiToWin32(orig_stdout).reset_all()
...
ValueError: I/O operation on closed file
When working through the first code block of the "Proximity Analysis" section in the Quickstart Guide, when running the third line (defining analyzer
), I get the following warning:
WARNING: The following .txt files were not loaded because they are not your metadata csv:
[...] # a list of names of .txt files
You may want to check that your metadata matches your files to avoid incorrect results.
When running pytest, after passing all tests, there is an error that seems to arise from the packages the product requires...
@joshfeli are you on Windows? I've also encountered this error running pytest on Windows, as well. I believe it has to do with the way that the windows terminal handles output redirection. Let's investigate together, if so.
@ryaanahmed Yes, I am on Windows--the same warning displays for every equivalent code block that involves defining some analyzer
variable.
However, I was speaking with someone who's using a Mac who has encountered the same problem.
@joshfeli, I think @ryaanahmed was referencing the errors you encountered when running pytest
. Sounds like a great place to do some pair debugging!
The warning you're seeing when initializing an instance of, say, GenderProximityAnalyzer
derives ultimately from the _load_documents_and_metadata
method on the Corpus
class (an instance of which is created when initializing GenderProximityAnalyzer
) and is expected. When initializing the Corpus
class, the user passes in a file_path
argument pointing to a directory with .txt
files and a csv_path
argument pointing to a .csv
file containing metadata about those files. If the .csv
doesn't contain references to all of the .txt
files in the directory you specify, you receive that warning. It doesn't prevent the successful initialization of Corpus
or any class that ultimately initializes a Corpus
.
When creating a Corpus
instance directly, you can pass the optional argument ignore_warnings=True
to suppress those warnings. We currently do not provide a mechanism to do so when initializing GenderProximityAnalyer
, so it might be worth thinking about adding it.
Overview
This issue contains preliminary notes on summer UROP work. It condenses many of the collected notes and issues created as the last few groups have picked up work on this project. The goal is to produce a series of three or four self-contained issues that the program can tackle during the summer.
First steps
Check out onboarding and installation documentation for the lab here. We'll spend the first week of the lab getting up to speed with the project and getting all of the required tooling set up.
Project ideas
Improve the interface
Relevant topics: object-oriented programming, interface design, data structures.
The analysis modules
dependency_parsing
anddunning
do not follow the updated package architecture as outlined in issue #157. Let's bring them up to the standards of the rest of the package, and, while we're at it, improve the test coverage on these modules. Consider whether this package would serve as a suitable upgrade for the existingdependency_parsing
dependency tree implementation.Improve memory usage
Relevant topics: data structures, algorithms, memory.
We could improve how optimized and performant our various analysis modules are. Leaving an assemblage of notes we've gathered over the last few months here until we can sort things out:
From issue #157: "I'd encourage us to think, here, about processing load! Initializing our Analyzer functions on medium-to-large corpora takes a really long time, so I think we want to do that as little as possible. For example, the
GenderProximityAnalyzer
function needs to be initialized for each part-of-speech set; if I want to look for verbs but have initialized based on adjectives, e.g., I have to initialize twice."Mentioned in Slack: "feature request: a progress bar for hefty analyses!"
Outlined in issue #135: there are two separate kinds of tokenization that occur in the codebase,
nltk.word_tokenize
and the custom implementation inDocument.get_tokenized_text()
. We may want to standardize our implementation, provide better memoization, and generally ensure that all analyses that rely on tokenized texts are able to retrieve the tokenized text optimally.Consider which packages/tools might be of use here (this package, vel sim.).
Improve how we manage texts
Relevant topics: interface design, web scraping, data types.
It's probable that we can improve how we work with our
.txt
files in the package as whole. We've identified one significant problem, for instance: if a user imports a.txt
file into aCorpus
orDocument
instance that has no content, many of our analyses fail (notes outlining the problem here). We also implicitly rely on the user constructing meaningful.csv
metadata files for many of our analysis modules, so at the very least we could help them out by updating our documentation with helpful tips for creating such a file (notes here). It's probably worth spending some time identifying other similar problems relating to how we format and use.txt
files.We could also revive a long-lived attempt to allow the user to load in Project Gutenberg (issue #41) texts.
Improve the user-facing output
Relevant topics: interface design, data structures, data visualization.
Most of our analysis modules produce Python dictionaries. These are relatively simple to traverse and to convert into other data formats (for instance,
pandas
DataFrame
s), but may not be the ideal data format for our users without that additional transformation. It's also likely that our users will want to produce data visualizations based on these analysis modules, much of which is relatively straightforward through something likepandas
andmatplotlib
(ex. in issue #165). How much can we and should we streamline some of that data visualization creation for our users? Which of our analysis modules are particular suitable for creating visualizations from?Additionally, there's some inconsistency throughout the package as to what data format we return and when. In many places we return dictionaries that would better be represented as
Counter
instances, for instance. Some of this topic is initially outlined in issue #104.General issues
frequency
module does not currently allow the user to find average pronoun counts across documents (called out in issue #165). Should we introduce that?.summary
onCorpus
? What would it print to the user?help()
on a particular class or method be more useful?Document.word_count
could use aremove_swords
flag.Document.get_wordcount_counter()
could use aremove_swords
flag.Document.words_associated()
andDocument.get_word_windows()
more sophisticated? These method measures only the occurrence of a word immediately after or within a window of the target word in the tokenized text. We're definitely picking up associations across syntactical breaks. While these undoubtedly have an association with the target word, they're 'less' associated than, say, words that are in an object-relationship with the target word.Document
methods, could we pull those out to theCorpus
class and organize the return byDocument
label? That way the interface would always be theCorpus
.Document.update_metadata()
would require us to update any cached returns in theanalysis
modules.Corpus.count_authors_by_gender()
takes a string argument to represent gender. Is that standard throughout theanalysis
modules?Corpus.get_wordcount_counter()
could use aremove_swords
flag.Corpus.get_field_vals()
could be supplanted by a.summary
vel sim.Corpus.get_sample_text_passages()
returns a list of tuples of the shapeTuple[str(document filename), str(sentence)]
. It might be more useful to return a dictionary with the document filename/label (filename minus extension) as the top-level keys. This data structure is very common throughout theanalysis
modules.dunning.dunn_individual_word_by_corpus()
throwsZeroDivisionError
if the target word doesn't exist in aCorpus
.metadata_visualizations
shouldn't be methods on theCorpus
class?