Closed ttruong-gilead closed 8 months ago
Hi!
Make sure to use start_document_analysis
with save_image=False
otherwise Textractor is pre-loading the images for visualizations which you likely do not care about.
Hopefully that is the issue.
@Belval thank you, Indeed that's the issue. We do need the page.image to save the table as a png or xlsx. Is there any way to still do this without the memory explosion? Maybe cache to disk?
You don't need to save the images to get the dimensions, you can use page.height
and page.width
.
As for why it's the default, originally it was to match analyze_document
and now it's to preserve backward compatibility, but I think I might add a warning for your use case.
@Belval I'm closing this issue but I still think there needs to be a disk cache mechanism instead of saving it to memory for very large files.
amazon-textract-caller==0.2.2 amazon-textract-response-parser==1.0.2 amazon-textract-textractor==1.7.4
Why is start_document_analysis() using so much memory? This is for pdfs of below 1000 pages. Even with memory size of 32GB my container gets killed due to overloaded memory. Is there a leak? Happens both locally and on ECS
I use the boto3 library with no issues on same pdfs.
Usage: