Closed fsatsuki closed 1 month ago
Memo: comparison when enables hi-res mode (partition.pdf) with multi processing
When embedding 30 PDFs of 15 to 150 pages, It takes 7 minutes with unstructured.partition.auto and 61 minutes with unstructured.partition.pdf
LGTM!
Issue #, if available:
297
Description of changes: When analyzing PDFs,
unstructured.partition.auto
was used, howeverunstructured.partition.pdf
capable of detailed structural analysis can be selected. When enablingunstructured.partition.pdf
, set“enable_partition_pdf”: true
, in cdk.jsonunstructured.partition.pdf
spends a lot of time. Implement parallel processing with multiple processes and shorten processing time by making it possible to change the container size for embedding withcdk.json
When embedding 30 PDFs of 15 to 150 pages, It takes 7 minutes with
unstructured.partition.auto
and 61 minutes withunstructured.partition.pdf
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.