daltonfury42 / truecase

A python true casing utility that restores case information for texts
Apache License 2.0
87 stars 16 forks source link

Use word_tokenize in combo with TreebankWordDetokenizer. Other small changes too #21

Closed keshprad closed 3 years ago

keshprad commented 3 years ago

Main Change

  1. In TrueCaser.py using word_tokenize over TweetTokenizer()
    • This allows accurate use of TreebankWordDetokenizer() to join tokens_true_case at end
    • Would appreciate it if someone could check that this doesn't screw anything up elsewhere.
      • TweetTokenizer takes hashtags (eg: '#music') as 1 token while word_tokenizer splits the hashtag. Hashtags don't seem to be in the vocabulary, so I think this is fine.
      • However, would like to double-check that this doesn't screw up the _bigram_backward_score or trigram_score_.

Some small changes

  1. In TrueCaser.py change first_token_case to use String capitalize()...
    • return raw.capitalize()
    • This does the same thing as before, but is more readable
  2. In TrueCaser.py add String capitalize() as an out_of_vocabulary_token_option
    • How is this different from title?
      • title() will capitalize all words in a string, while capitalize() is just the first character.
      • eg: when token = "hip-hop"
      • capitalize() -> 'Hip-hop'
      • title() -> 'Hip-Hop'
  3. In TrueCaser.py method get_score(): I've refactored nominator to numerator
    • numerator is a more descriptive variable name as the variable is the "numerator" for the score calcs
keshprad commented 3 years ago

I noticed that tokenizing with word_tokenizer also solves an existing problem where sometimes, possessive names (eg: "Ron's") capitalizes the 's' after the apostrophe.

Example: "ron's shOw Is A Big hit!"

Currently, outputs the following: current

Using word_tokenizer, "ron's" is taken as a single token and outputs the following: PR #21

daltonfury42 commented 3 years ago

Thank you, is it fine if I look at at during weekend? If you want me to push a release before that, let me know.

keshprad commented 3 years ago

Yes, weekend is fine. Thanks!

daltonfury42 commented 3 years ago

Can you add some good test cases here so that we catch regressions?

keshprad commented 3 years ago

Added test cases. :)