Text extraction from PDF files fails with character corruption, preventing proper content retrieval.
To Reproduce
Run the embedding command specifying multiple PDF files with Japanese languages
Check the logs, and you will see the message INFO:unstructured:PDF text extraction failed, skip text extraction...
Inspect the extracted content, and you will find parts of it are corrupted
Note: The INFO:unstructured:PDF text extraction failed, skip text extraction... error is observed when multiple threads attempt to download nltk simultaneously.
Describe the bug
Text extraction from PDF files fails with character corruption, preventing proper content retrieval.
To Reproduce
INFO:unstructured:PDF text extraction failed, skip text extraction...
Note: The
INFO:unstructured:PDF text extraction failed, skip text extraction...
error is observed when multiple threads attempt to download nltk simultaneously.