MBJean commented 3 years ago

Overview

This issue contains preliminary notes on summer UROP work. It condenses many of the collected notes and issues created as the last few groups have picked up work on this project. The goal is to produce a series of three or four self-contained issues that the program can tackle during the summer.

First steps

Check out onboarding and installation documentation for the lab here. We'll spend the first week of the lab getting up to speed with the project and getting all of the required tooling set up.

Project ideas

Improve the interface

Relevant topics: object-oriented programming, interface design, data structures.

The analysis modules dependency_parsing and dunning do not follow the updated package architecture as outlined in issue #157. Let's bring them up to the standards of the rest of the package, and, while we're at it, improve the test coverage on these modules. Consider whether this package would serve as a suitable upgrade for the existing dependency_parsing dependency tree implementation.

Improve memory usage

Relevant topics: data structures, algorithms, memory.

We could improve how optimized and performant our various analysis modules are. Leaving an assemblage of notes we've gathered over the last few months here until we can sort things out:

From issue #157: "I'd encourage us to think, here, about processing load! Initializing our Analyzer functions on medium-to-large corpora takes a really long time, so I think we want to do that as little as possible. For example, the GenderProximityAnalyzer function needs to be initialized for each part-of-speech set; if I want to look for verbs but have initialized based on adjectives, e.g., I have to initialize twice."

Mentioned in Slack: "feature request: a progress bar for hefty analyses!"

Outlined in issue #135: there are two separate kinds of tokenization that occur in the codebase, nltk.word_tokenize and the custom implementation in Document.get_tokenized_text(). We may want to standardize our implementation, provide better memoization, and generally ensure that all analyses that rely on tokenized texts are able to retrieve the tokenized text optimally.

Consider which packages/tools might be of use here (this package, vel sim.).

Improve how we manage texts

Relevant topics: interface design, web scraping, data types.

It's probable that we can improve how we work with our .txt files in the package as whole. We've identified one significant problem, for instance: if a user imports a .txt file into a Corpus or Document instance that has no content, many of our analyses fail (notes outlining the problem here). We also implicitly rely on the user constructing meaningful .csv metadata files for many of our analysis modules, so at the very least we could help them out by updating our documentation with helpful tips for creating such a file (notes here). It's probably worth spending some time identifying other similar problems relating to how we format and use .txt files.

We could also revive a long-lived attempt to allow the user to load in Project Gutenberg (issue #41) texts.

Improve the user-facing output

Relevant topics: interface design, data structures, data visualization.

Most of our analysis modules produce Python dictionaries. These are relatively simple to traverse and to convert into other data formats (for instance, pandas DataFrames), but may not be the ideal data format for our users without that additional transformation. It's also likely that our users will want to produce data visualizations based on these analysis modules, much of which is relatively straightforward through something like pandas and matplotlib (ex. in issue #165). How much can we and should we streamline some of that data visualization creation for our users? Which of our analysis modules are particular suitable for creating visualizations from?

Additionally, there's some inconsistency throughout the package as to what data format we return and when. In many places we return dictionaries that would better be represented as Counter instances, for instance. Some of this topic is initially outlined in issue #104.

General issues

The new frequency module does not currently allow the user to find average pronoun counts across documents (called out in issue #165). Should we introduce that?
Could we create a .summary on Corpus? What would it print to the user?
Should help() on a particular class or method be more useful?
Document.word_count could use a remove_swords flag.
Document.get_wordcount_counter() could use a remove_swords flag.
Could we consider making Document.words_associated() and Document.get_word_windows() more sophisticated? These method measures only the occurrence of a word immediately after or within a window of the target word in the tokenized text. We're definitely picking up associations across syntactical breaks. While these undoubtedly have an association with the target word, they're 'less' associated than, say, words that are in an object-relationship with the target word.
For many of the Document methods, could we pull those out to the Corpus class and organize the return by Document label? That way the interface would always be the Corpus.
Document.update_metadata() would require us to update any cached returns in the analysis modules.
Corpus.count_authors_by_gender() takes a string argument to represent gender. Is that standard throughout the analysis modules?
Corpus.get_wordcount_counter() could use a remove_swords flag.
Corpus.get_field_vals() could be supplanted by a .summary vel sim.
Corpus.get_sample_text_passages() returns a list of tuples of the shape Tuple[str(document filename), str(sentence)]. It might be more useful to return a dictionary with the document filename/label (filename minus extension) as the top-level keys. This data structure is very common throughout the analysis modules.
dunning.dunn_individual_word_by_corpus() throws ZeroDivisionError if the target word doesn't exist in a Corpus.
Any reason metadata_visualizations shouldn't be methods on the Corpus class?

farooqashar commented 3 years ago

If there are various Lookup errors that occur when you run pytest, please note that you might need to install resources/modules manually on your local machine, such as by running import nltk and then nltk.download('stopwords').

j-aslarus commented 3 years ago

When running pytest, Python will open up (presumably trying to create data visualizations) but no visualizations will be created and the program will stall and needs to be forcibly terminated. The cause of this issue and its solution is unclear.

joshfeli commented 3 years ago

When running pytest, after passing all tests, there is an error that seems to arise from the packages the product requires--the error looks something like:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
    File "...\gender_analysis\venv\lib\site-packages\clint\packages\colorama\initialise.py", line 17, in reset_all
        AnsiToWin32(orig_stdout).reset_all()
    ...
ValueError: I/O operation on closed file

joshfeli commented 3 years ago

When working through the first code block of the "Proximity Analysis" section in the Quickstart Guide, when running the third line (defining analyzer), I get the following warning:

WARNING: The following .txt files were not loaded because they are not your metadata csv:
[...] # a list of names of .txt files
You may want to check that your metadata matches your files to avoid incorrect results.

ryaanahmed commented 3 years ago

When running pytest, after passing all tests, there is an error that seems to arise from the packages the product requires...

@joshfeli are you on Windows? I've also encountered this error running pytest on Windows, as well. I believe it has to do with the way that the windows terminal handles output redirection. Let's investigate together, if so.

joshfeli commented 3 years ago

@ryaanahmed Yes, I am on Windows--the same warning displays for every equivalent code block that involves defining some analyzer variable.

However, I was speaking with someone who's using a Mac who has encountered the same problem.

MBJean commented 3 years ago

@joshfeli, I think @ryaanahmed was referencing the errors you encountered when running pytest. Sounds like a great place to do some pair debugging!

The warning you're seeing when initializing an instance of, say, GenderProximityAnalyzer derives ultimately from the _load_documents_and_metadata method on the Corpus class (an instance of which is created when initializing GenderProximityAnalyzer) and is expected. When initializing the Corpus class, the user passes in a file_path argument pointing to a directory with .txt files and a csv_path argument pointing to a .csv file containing metadata about those files. If the .csv doesn't contain references to all of the .txt files in the directory you specify, you receive that warning. It doesn't prevent the successful initialization of Corpus or any class that ultimately initializes a Corpus.

When creating a Corpus instance directly, you can pass the optional argument ignore_warnings=True to suppress those warnings. We currently do not provide a mechanism to do so when initializing GenderProximityAnalyer, so it might be worth thinking about adding it.

dhmit / gender_analysis

Summer UROP planning #170

Overview

First steps

Project ideas

Improve the interface

Improve memory usage

Improve how we manage texts

Improve the user-facing output

General issues