davidmogar / cucco

Text normalization library for Python
MIT License
203 stars 27 forks source link

Order of operations #35

Closed JasonCrowe closed 7 years ago

JasonCrowe commented 7 years ago

w = 'Car , 950' cucco.normalize(w)

The program seems to check for whitespace to remove before removing punctuation. This causes it to return 'Car__950' rather than 'Car_950'.

ETA: added underscore in place of spaces to show effect.

davidmogar commented 7 years ago

Hi @JasonCrowe :hand:

Thank you for your report. The thing is that this is the expected behavior as it's using the default normalizations. Those are:

DEFAULT_NORMALIZATIONS = [
    'remove_extra_whitespaces',
    'replace_punctuation',
    'replace_symbols',
    'remove_stop_words'
]

But maybe this list should change. What do you think?

JasonCrowe commented 7 years ago

If remove_extra_whitespace was the last operation in default, wouldn't it fix this issue? Am I understanding that right? If this isn't applicable to the package, I am happy to change my local copy if it will fix it.

davidmogar commented 7 years ago

Hi @JasonCrowe,

Yeah, the result would be the one you expect. But again, not really an issue.

Having said this, I kind of agree with you so I will move remote_extra_whitespaces down in the next version. There is not a date for it yet because is going to be a major release (it will include a CLI) but hopefully will be in less than a week.

Thanks again for you comments ;)