dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Named entity recognition #113

Open kenalba opened 3 years ago

kenalba commented 3 years ago

A useful feature for performing analysis that is tied less to gendered pronouns is to use proper names. This would allow a user to see adjectives used to describe a particular character, or a particular family of characters.

A naïve solution to this might just search for words whose first letter is capitalized and that don't show up in a dictionary, though I suspect we'll need a more robust algorithm to make this usable. We might also be able to use our POS tagger to get us part of the way there. There are open source approaches to the problem; it seems like spaCy might be able to do what we want, here.

fyang3 commented 3 years ago

An issue: a character in a novel might have more than 1 name. For instance, Emma Woodhouse could also be called Emma. If we just want to identify all the names there are in a sentence, then it should not be difficult; yet we do need to acknowledge the fact that there are multiple "identities" for a character

kenalba commented 3 years ago

A very good point! Collapsing multiple 'nicknames' into a single entity is a nontrivial task. If it seems possible, though, we should consider looking into it.

It might mean creating a Character class that has a series of other names associated with them. This approach would pay particular dividends were we to look into, say, fanfiction, where the same character might show up in multiple novels. Worth brainstorming about, for sure.