aws-samples / bedrock-claude-chat

AWS-native chatbot using Bedrock + Claude (+Mistral)
MIT No Attribution
693 stars 238 forks source link

[BUG] PDF Text Extraction Fails with Character Corruption #413

Closed statefb closed 3 days ago

statefb commented 3 days ago

Describe the bug

Text extraction from PDF files fails with character corruption, preventing proper content retrieval.

To Reproduce

  1. Run the embedding command specifying multiple PDF files with Japanese languages
  2. Check the logs, and you will see the message INFO:unstructured:PDF text extraction failed, skip text extraction...
  3. Inspect the extracted content, and you will find parts of it are corrupted

Note: The INFO:unstructured:PDF text extraction failed, skip text extraction... error is observed when multiple threads attempt to download nltk simultaneously.