[BUG] PDF Text Extraction Fails with Character Corruption

Describe the bug

Text extraction from PDF files fails with character corruption, preventing proper content retrieval.

To Reproduce

Run the embedding command specifying multiple PDF files with Japanese languages
Check the logs, and you will see the message INFO:unstructured:PDF text extraction failed, skip text extraction...
Inspect the extracted content, and you will find parts of it are corrupted

Note: The INFO:unstructured:PDF text extraction failed, skip text extraction... error is observed when multiple threads attempt to download nltk simultaneously.

aws-samples / bedrock-claude-chat

[BUG] PDF Text Extraction Fails with Character Corruption #413

Describe the bug

To Reproduce