Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.5k stars 585 forks source link

bug/element type for non-English languages #3044

Open cm-halfspace opened 1 month ago

cm-halfspace commented 1 month ago

Describe the bug When I partition a Danish .docx file I notice some weird classifications of the element types.

I think this is related to the fact that the languages-list is not being set in _parse_paragraph_text_for_element_type, eg in is_possible_narrative_text(text).

If one takes a look at the definition of is_possible_narrative_text it seems that a quick temporary solution would be to at least use language_checks in line 90 such that it instead becomes:

if "eng" in languages and language_checks and (sentence_count(text, 3) < 2) and (not contains_verb(text)):

To Reproduce

from unstructured.partition.text_type import is_possible_narrative_text
text = "Dette er et eksempel på en kort sætning."
is_possible_narrative_text(text)

which returns False right now. With the above quick-fix, it would return True as expected.

MthwRobinson commented 1 month ago

Hi @cm-halfspace - thanks for reporting this! We'll look at this as soon as we can, or happy to review if you want to open a PR with your suggested change.