davidmogar / cucco

Text normalization library for Python
MIT License
203 stars 27 forks source link

Incomplete normalization #43

Closed ptynecki closed 7 years ago

ptynecki commented 7 years ago

Hi guys,

Let's say that I wanna normalise that string:

"Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better.

Without extra Cucco setup (normalizations) I received:

"Protein Recommendations Bodybuilders Case".

With extra Cucco setup:

normalizations = [
    'remove_extra_whitespaces',
    'remove_accent_marks',
    'remove_stop_words',
    ('replace_hyphens', {'replacement': ''}),
    ('replace_punctuation', {'replacement': ''}),
    ('replace_symbols', {'replacement': ''}),
]

I received:

"Protein Recommendations Bodybuilders Case Better"

My question is: where is the rest part of the string?

davidmogar commented 7 years ago

Hi @Katharsis,

First of all thank you very much for taking time on reporting a possible bug. I really appreciate it. This is the way to improve cucco.

I've been checking the behavior you comment and in this case the output is the expected one. In the first execution, the one with default normalizations, normalizations are applied this way:

  1. Protein Recommendations for Bodybuilders In This Case More May Indeed Be Better (punctuation replaced).
  2. Protein Recommendations for Bodybuilders In This Case More May Indeed Be Better (extra white spaces removed).
  3. Protein Recommendations for Bodybuilders In This Case More May Indeed Be Better (symbols removed).
  4. Protein Recommendations Bodybuilders Case (stop words removed)

Here you can see the execution:

$ cucco normalize 'Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better.'
Protein Recommendations Bodybuilders Case

The normalization you propose would look like this:

  1. Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better. (whitespaces removed)
  2. Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better. (accent marks removed)
  3. Protein Recommendations Bodybuilders: Case, Better. (stop words removed)
  4. Protein Recommendations Bodybuilders: Case, Better. (hyphens replaced)
  5. Protein Recommendations Bodybuilders Case Better (punctuation replaced)
  6. Protein Recommendations Bodybuilders Case Better (symbols replaced)
$ cat config.yaml
normalizations:
  - remove_extra_whitespaces
  - remove_accent_marks
  - remove_stop_words
  - replace_hyphens
  - replace_punctuation
  - replace_symbols
$ cucco -c config.yaml normalize 'Protein Recommendations for Bodybuilders: In This Case, More May Indeed Be Better.'
Protein Recommendations Bodybuilders Case Better

Note that in the last case I'm omitting the values for replacement as the value you set is actually the default value.

So, I'm closing this issue as I don't think is a real bug. If I didn't understand you or if you think the behavior should be different, please, feel free to comment and reopen it.

Happy normalization ;)