chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

call to preprocess.preprocess_text returns TypeError (0.7.0) #243

Closed gryBox closed 5 years ago

gryBox commented 5 years ago

steps to reproduce

some_string = "'A chemical combination brought about by the action of light, as in the formation of carbohydrates in living plants from the carbon di-oxid and water of the air under the influence of sunlight."

Scenario 1

import textacy
textacy.preprocess.preprocess_text(some_string , 
                                       fix_unicode=True, 
                                       lowercase=False, 
                                       no_urls=False, 
                                       no_emails=False, 
                                       no_phone_numbers=False, no_numbers=False, 
                                       no_currency_symbols=False, no_punct=False, 
                                       no_contractions=False, 
                                       no_accents=False)

Result:

~/anaconda3/envs/py36-ml/lib/python3.6/site-packages/textacy/preprocess.py in preprocess_text(text, fix_unicode, lowercase, no_urls, no_emails, no_phone_numbers, no_numbers, no_currency_symbols, no_punct, no_contractions, no_accents)
    246         text = text.lower()
    247     # always normalize whitespace; treat linebreaks separately from spacing
--> 248     text = normalize_whitespace(text)
    249 
    250     return text

~/anaconda3/envs/py36-ml/lib/python3.6/site-packages/textacy/preprocess.py in normalize_whitespace(text)
     39     """
     40     return constants.RE_NONBREAKING_SPACE.sub(
---> 41         " ", constants.RE_LINEBREAK.sub(r"\n", text)
     42     ).strip()
     43 

TypeError: expected string or bytes-like object

Should fix_unicode be removed since it is no longer supported by textacy directly?

Scenario 2 (all false)

import textacy
textacy.preprocess.preprocess_text(some_string , 
                                       fix_unicode=False, 
                                       lowercase=False, 
                                       no_urls=False, 
                                       no_emails=False, 
                                       no_phone_numbers=False, no_numbers=False, 
                                       no_currency_symbols=False, no_punct=False, 
                                       no_contractions=False, 
                                       no_accents=False)

Result:
'A chemical combination brought about by the action of light, as in the formation of carbohydrates in living plants from the carbon di-oxid and water of the air under the influence of sunlight.'

expected vs. actual behavior

"'a chemical combination brought about by the action of light as in the formation of carbohydrates in living plants from the carbon di oxid and water of the air under the influence of sunlight"

I know preprocess worked in 0.6.x

environment

bdewilde commented 5 years ago

Oof, thanks for the heads-up. When I scrapped fix_bad_unicode(), I accidentally wrote return NotImplementedError rather than raise NotImplementedError — will fix.

In this function, whitespace is always normalized, even if everything else is False. Is that surprising/undesirable behavior?

gryBox commented 5 years ago

IMO: It should be explicit. i.e fix_unicode=False... If it;s worth the discussion for you, my question is what is the rational the other way?

bdewilde commented 5 years ago

I'm not sure I follow... Normally, I'd mark a function as deprecated then leave it as-is for a bit, but since I yanked this function out suddenly, I left it in (albeit with a bug) so that folks would see the NotImplementedError and message explaining why the functionality is gone and how to reproduce it on their own.

gryBox commented 5 years ago

Yeah - I misunderstood your question. Still do. Can you rip it out completely? fix_unicode is just not part of textacy anymore. I would leave the Warning:

image