Use word_tokenize in combo with TreebankWordDetokenizer. Other small changes too

keshprad commented 3 years ago

Main Change

In TrueCaser.py using word_tokenize over TweetTokenizer()
- This allows accurate use of TreebankWordDetokenizer() to join tokens_true_case at end
- Would appreciate it if someone could check that this doesn't screw anything up elsewhere.
  - TweetTokenizer takes hashtags (eg: '#music') as 1 token while word_tokenizer splits the hashtag. Hashtags don't seem to be in the vocabulary, so I think this is fine.
  - However, would like to double-check that this doesn't screw up the _bigram_backward_score or trigram_score_.

Some small changes

In TrueCaser.py change first_token_case to use String capitalize()...
- return raw.capitalize()
- This does the same thing as before, but is more readable
In TrueCaser.py add String capitalize() as an out_of_vocabulary_token_option
- How is this different from title?
  - title() will capitalize all words in a string, while capitalize() is just the first character.
  - eg: when token = "hip-hop"
  - capitalize() -> 'Hip-hop'
  - title() -> 'Hip-Hop'
In TrueCaser.py method get_score(): I've refactored nominator to numerator
- numerator is a more descriptive variable name as the variable is the "numerator" for the score calcs

keshprad commented 3 years ago

I noticed that tokenizing with word_tokenizer also solves an existing problem where sometimes, possessive names (eg: "Ron's") capitalizes the 's' after the apostrophe.

Example: "ron's shOw Is A Big hit!"

Currently, outputs the following: current