(PDF) How to let `partition_pdf` and `partition_via_api` detect automatically language(s) of a PDF?

Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

https://www.unstructured.io/

Apache License 2.0

8.44k stars 692 forks source link

(PDF) How to let `partition_pdf` and `partition_via_api` detect automatically language(s) of a PDF? #2288

Open piegu opened 9 months ago

piegu commented 9 months ago

In the file lang.py, I see the use of the library langdetect.

In the same file, there is a function detect_languages() but it looks that partition_pdf and partition_via_api do not use it in the case of a PDF.

If it is true, why partition_pdf and partition_via_api do not use it to detect automatically the languages of the PDF?

Because of that, we have to write manually in the parameter languages the list of languages of the PDF.

Did I miss something?

huangpan2507 commented 2 months ago

+1, good question, I also met the problem when to process pdf(two language text inside), it can process english, but not Chinese word.