Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.5k stars 585 forks source link

bug/ contains_english_word catches English words also found in non English languages #1403

Open shreyanid opened 9 months ago

shreyanid commented 9 months ago

Describe the bug The function contains_english_word(text) in text-type.py checks the input text against a list of English words to determine if the text contains an English word. However, English words that are also present in other languages (ex. "no" in Spanish) are also getting matched by this function, so checks like

if language == "en" and language_checks and not contains_english_word(text):

in is_possible_narrative_text are failing when they should be entering this case.

To Reproduce Example:

narrative_text = "Hola, ¿cómo estás? No, no hablo inglés."

text_type.is_possible_narrative_text(narrative_text, language="en") # should be False, IS TRUE
text_type.is_possible_narrative_text(narrative_text, language="es") # should be True, is True

Expected behavior Only English words in written in English text should match this function, not the presence of any English word (even in other languages when the words are unrelated).

shreyanid commented 9 months ago

We expect this functionality to change significantly with the introduction of langdetect for document language detection

MthwRobinson commented 1 month ago

We can close this now since contains_english_word is no longer used. Open #3007 to remove the unused code path.