dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Simplify Doctests #122

Open samimak37 opened 4 years ago

samimak37 commented 4 years ago

In several functions, we have doctests that are effectively info-dumping a large and complex dictionary. This might be fine for internal tests, but we should simplify the doctests for user readability. For instance:

https://github.com/dhmit/gender_analysis/blob/ee1d41f1201202b9f608de8030c0059f0047d980/gender_analysis/analysis/gender_frequency.py#L266-L271

This outputted dictionary is much too long to act as a meaningful example for someone that is trying to understand the function, and we could probably simplify it by just breaking up the output into different components or trimming down the dictionary.

samimak37 commented 4 years ago

I think this could be handled fairly easily once #104 is handled as well, as the scope seems somewhat similar between the two.

kenalba commented 4 years ago

I'd like to add onto this by suggesting that we remove the test_corpus directory entirely, as right now it just contains the first 10 documents from sample_novels. We do use this corpus for tests in a few places, but if we add an ignore_warnings flag to _load_documents_and_metadata and the Corpus initializer, we can simply make 2 different csv files - large_corpus and small_corpus, and initialize based on the csv files rather than all of the documents in the directory.

I'd also want to create a third csv file, tiny_corpus, with 4 documents in it, for testing some of the more resource-intensive functions (e.g. run_adj_analysis).

I've implemented this in the tighter_testing branch but am going to hold off on PRing until @ryaanahmed or @samimak37 chime in and say whether or not it's worth doing.