If one takes a look at the definition of is_possible_narrative_text it seems that a quick temporary solution would be to at least use language_checks in line 90 such that it instead becomes:
if "eng" in languages and language_checks and (sentence_count(text, 3) < 2) and (not contains_verb(text)):
To Reproduce
from unstructured.partition.text_type import is_possible_narrative_text
text = "Dette er et eksempel på en kort sætning."
is_possible_narrative_text(text)
which returns False right now. With the above quick-fix, it would return True as expected.
Hi @cm-halfspace - thanks for reporting this! We'll look at this as soon as we can, or happy to review if you want to open a PR with your suggested change.
Describe the bug When I partition a Danish .docx file I notice some weird classifications of the element types.
I think this is related to the fact that the
languages
-list is not being set in _parse_paragraph_text_for_element_type, eg inis_possible_narrative_text(text)
.If one takes a look at the definition of
is_possible_narrative_text
it seems that a quick temporary solution would be to at least uselanguage_checks
in line 90 such that it instead becomes:To Reproduce
which returns
False
right now. With the above quick-fix, it would returnTrue
as expected.