Open samimak37 opened 4 years ago
I think this could be handled fairly easily once #104 is handled as well, as the scope seems somewhat similar between the two.
I'd like to add onto this by suggesting that we remove the test_corpus
directory entirely, as right now it just contains the first 10 documents from sample_novels
. We do use this corpus for tests in a few places, but if we add an ignore_warnings flag to _load_documents_and_metadata
and the Corpus
initializer, we can simply make 2 different csv files - large_corpus
and small_corpus
, and initialize based on the csv files rather than all of the documents in the directory.
I'd also want to create a third csv file, tiny_corpus
, with 4 documents in it, for testing some of the more resource-intensive functions (e.g. run_adj_analysis
).
I've implemented this in the tighter_testing
branch but am going to hold off on PRing until @ryaanahmed or @samimak37 chime in and say whether or not it's worth doing.
In several functions, we have doctests that are effectively info-dumping a large and complex dictionary. This might be fine for internal tests, but we should simplify the doctests for user readability. For instance:
https://github.com/dhmit/gender_analysis/blob/ee1d41f1201202b9f608de8030c0059f0047d980/gender_analysis/analysis/gender_frequency.py#L266-L271
This outputted dictionary is much too long to act as a meaningful example for someone that is trying to understand the function, and we could probably simplify it by just breaking up the output into different components or trimming down the dictionary.