dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Handle Epistolary Novels #137

Open kenalba opened 4 years ago

kenalba commented 4 years ago

We could build a method that takes in a Document, detects whether or not it's an epistolary novel, and then breaks the document up into a dictionary of letters (or a list of Letter objects?). We'll want to programmatically detect the writer of each letter and include that in the metadata.

Ideally, we can programmatically determine metadata for each letter - writer, date, recipient, and so on. That's going to be tricky, but maybe possible. If we combine this functionality with our hypothetical named entity recognition module (to get a character list) and a ML-based gender guesser for each character, we can do some classy stuff.

fyang3 commented 4 years ago

https://www.mygreatlearning.com/blog/named-entity-recognition/. A pretty good overview for Named entity recognition. Microsoft Azure also has NLP modules on this: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/named-entity-recognition For detecting gender by names we could use NLTK or Scik-learn and build our own classifier (so we need to decide what features we'd like: https://www.geeksforgeeks.org/python-gender-identification-by-name-using-nltk/ This is an example of building up a classifier: https://gist.github.com/vinovator/6e5bf1e1bc61687a1e809780c30d6bf6