Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.57k stars 703 forks source link

bug/ingest language should not default to 'eng' but None #1715

Closed rbiseck3 closed 11 months ago

rbiseck3 commented 11 months ago

Describe the bug By defaulting to eng, this actually injects information that might not be accurate and breaks the language detection library being used. This should instead default to None and let the library do what it needs to to detect the language.

To Reproduce Running local connector on the example-docs/language-docs/UDHR_first_article_all.txt file produces only english languages.

Expected behavior Running local connector on the example-docs/language-docs/UDHR_first_article_all.txt should result in ['ind', 'est']

Coniferish commented 11 months ago

*Commenting to track this