aws-samples / bedrock-claude-chat

AWS-native chatbot using Bedrock + Claude (+Mistral)
MIT No Attribution
693 stars 237 forks source link

Partition pdf #301

Closed fsatsuki closed 1 month ago

fsatsuki commented 1 month ago

Issue #, if available:

297

Description of changes: When analyzing PDFs, unstructured.partition.auto was used, however unstructured.partition.pdf capable of detailed structural analysis can be selected. When enabling unstructured.partition.pdf , set “enable_partition_pdf”: true , in cdk.json

unstructured.partition.pdf spends a lot of time. Implement parallel processing with multiple processes and shorten processing time by making it possible to change the container size for embedding with cdk.json

When embedding 30 PDFs of 15 to 150 pages, It takes 7 minutes with unstructured.partition.auto and 61 minutes with unstructured.partition.pdf

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

statefb commented 1 month ago

Memo: comparison when enables hi-res mode (partition.pdf) with multi processing

When embedding 30 PDFs of 15 to 150 pages, It takes 7 minutes with unstructured.partition.auto and 61 minutes with unstructured.partition.pdf

image

statefb commented 1 month ago

LGTM!