alteryx / nlp_primitives

Natural Language Processing primitives for Featuretools
https://blog.featurelabs.com/natural-language-processing-featuretools/
BSD 3-Clause "New" or "Revised" License
37 stars 11 forks source link

Support Unicode #185

Open sbadithe opened 2 years ago

sbadithe commented 2 years ago

As a user, I wish NLP Primitives had the ability to handle unicode text.

Currently, Unicode text is not correctly handled by regexes in nlp_primitives.

For example, Àbc is not recognized as a title word by TitleWordCount (Abc is).

gsheni commented 2 years ago

@sbadithe Is it possible to make a pytest fixture and have it be used by all the NL primitives? That way if we add more NL primitives in the future, we can make sure they support unicode.