dhmit / gender_analysis

A toolkit for analyzing gendered language across sets of documents
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

Keep punctuation and capitalization for get_sample_text_passages #105

Open kenalba opened 4 years ago

kenalba commented 4 years ago

Currently, get_sample_text_passages outputs post-tokenized strings that've been stripped of punctuation and capitalization. While this makes sense for searching, the output should be as true to the raw text as possible.

kenalba commented 4 years ago

https://github.com/dhmit/gender_analysis/pull/111 addresses this, but I don't think it quite fixes the problem. Right now, the Document class uses a method, get_tokenizedtext, that 'tokenizes' by looping through the text and literally stripping out all punctuation from an excluded character set (!"#$%&'()*+,-./:;<=>?@[\]^`{|}~). It does not, as mentioned in the comment, handle dashes or contractions properly.

PR 111 might gesture towards a solution to the problem, but unless we use the Treebank tokenizer (which we don't, right now) I don't think we're going to get reasonable results from its automated detokenizing.

kenalba commented 4 years ago

Some further thoughts on tokenizing: we already use word_tokenize (which uses punkt's tokenizer) in get_pos; it might make sense to use that and have a tokenized version of the text as a piece of a Document object? Tokenizing takes a good amount of time, though; ideally, I'd think we'd do that work in a thread so that other analysis can take place concurrently. Alternatively, we could use wordpunct_tokenize, which is faster and uses regexes to tokenize. Worth thinking about, but maybe not for the alpha.