Corpus __init__ -- take dataset locations as args, not a corpus name

ryaanahmed commented 5 years ago

@samimak37 -- @sophiazhi, @meesuekim, and I were brainstorming about what the best way into the module would be, and we realized that we shouldn't always require the user to provide a metadata csv file. There a lot of our analysis tools that you can do with a corpus with no metadata.

So, we need a Corpus.__init__() that takes as args

path to a directory where the txt files live OR path to a pickled corpus (require one or another)
optionally the location of a metadata CSV file
a bool of whether to pickle the corpus on loading or not (I'm not sure what the default should be yet... on the one hand these potentially big and slow loads, so pickling by default makes sense, but on the other hand I really don't like the idea of serializing and writing to disk without actually asking the user)

If the caller does not supply the location of a metadata CSV file, we'll construct a metadata dict ourselves, with only a 'filename' key.

ryaanahmed commented 5 years ago

Maybe take 'name' as an optional arg...

ryaanahmed commented 5 years ago

done! wahoo.

dhmit / gender_analysis

Corpus init -- take dataset locations as args, not a corpus name #16

dhmit / gender_analysis

Corpus __init__ -- take dataset locations as args, not a corpus name #16

Corpus init -- take dataset locations as args, not a corpus name #16