Open kenalba opened 4 years ago
https://github.com/dhmit/gender_analysis/pull/111 addresses this, but I don't think it quite fixes the problem. Right now, the Document
class uses a method, get_tokenizedtext, that 'tokenizes' by looping through the text and literally stripping out all punctuation from an excluded character set (!"#$%&'()*+,-./:;<=>?@[\]^`{|}~). It does not, as mentioned in the comment, handle dashes or contractions properly.
PR 111 might gesture towards a solution to the problem, but unless we use the Treebank tokenizer (which we don't, right now) I don't think we're going to get reasonable results from its automated detokenizing.
Some further thoughts on tokenizing: we already use word_tokenize (which uses punkt's tokenizer) in get_pos; it might make sense to use that and have a tokenized version of the text as a piece of a Document
object? Tokenizing takes a good amount of time, though; ideally, I'd think we'd do that work in a thread so that other analysis can take place concurrently. Alternatively, we could use wordpunct_tokenize, which is faster and uses regexes to tokenize. Worth thinking about, but maybe not for the alpha.
Currently,
get_sample_text_passages
outputs post-tokenized strings that've been stripped of punctuation and capitalization. While this makes sense for searching, the output should be as true to the raw text as possible.