chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Apostrophes #259

Closed BernierCR closed 5 years ago

BernierCR commented 5 years ago

Hello. I would like to request a little refactoring.

I am glad you have fix apostrophe's code. This is a very common problem. However, it's buried deep in text_utils.clean_terms(). I don't want to call this function, doesn't seem as clean or stable.

I would like this functionality to be moved to textacy.preprocessing.normalize.normalize_apostrophes(sentence)

Also, it would be nice if there was a wrapper function around all the new preprocessing functions. Then you could do your cleaning with one line of code, instead of 15. People could just pass in flags to control which functionality they want. For convenience, add a lowercase option to this wrapper.

If you do these things, it would restore some of the functionality lost during the change to 0.8, without hurting any of the benefits.

Thank you very much.

bdewilde commented 5 years ago

Hi @BernierCR , could you explain — and provide examples of — how you'd like to "fix" apostrophes? I have an okay implementation in the function you mentioned, specifically in the context of cleaning up terms as output by, say, a keyterm extraction algorithm. But use cases / desired functionality for text pre-processing may be different. I'm happy to see about adding this sort of functionality in.

I deliberately removed the wrapper function for the various text pre-processing functions because it didn't provide a compelling advantage over applying the desired functions sequentially, it wasn't set up to pass in non-default arguments to the underlying functions, and as the number of pre-processing functions increased, it got more complex and less user-friendly. I stand by that decision.

bdewilde commented 5 years ago

Hi again, just following up to say that I think the preprocessing.normalize_quotation_marks() function should take care of normalizing apostrophes, since they are often represented as a single right quotation mark.

>>> textacy.preprocessing.normalize_quotation_marks("What’s that function?")
"What's that function?"