Closed keshprad closed 3 years ago
I noticed that tokenizing with word_tokenizer also solves an existing problem where sometimes, possessive names (eg: "Ron's") capitalizes the 's' after the apostrophe.
Example: "ron's shOw Is A Big hit!"
Currently, outputs the following:
Using word_tokenizer, "ron's" is taken as a single token and outputs the following:
Thank you, is it fine if I look at at during weekend? If you want me to push a release before that, let me know.
Yes, weekend is fine. Thanks!
Can you add some good test cases here so that we catch regressions?
Added test cases. :)
Main Change
TrueCaser.py
usingword_tokenize
overTweetTokenizer()
TreebankWordDetokenizer()
to jointokens_true_case
at endTweetTokenizer
takes hashtags (eg: '#music') as 1 token whileword_tokenizer
splits the hashtag. Hashtags don't seem to be in the vocabulary, so I think this is fine.Some small changes
TrueCaser.py
changefirst_token_case
to use String capitalize()...return raw.capitalize()
TrueCaser.py
add String capitalize() as anout_of_vocabulary_token_option
title()
will capitalize all words in a string, whilecapitalize()
is just the first character.TrueCaser.py
methodget_score()
: I've refactorednominator
tonumerator
numerator
is a more descriptive variable name as the variable is the "numerator" for the score calcs