aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
404 stars 145 forks source link

start_document_analysis high memory usage #316

Closed ttruong-gilead closed 8 months ago

ttruong-gilead commented 8 months ago

amazon-textract-caller==0.2.2 amazon-textract-response-parser==1.0.2 amazon-textract-textractor==1.7.4

Why is start_document_analysis() using so much memory? This is for pdfs of below 1000 pages. Even with memory size of 32GB my container gets killed due to overloaded memory. Is there a leak? Happens both locally and on ECS

I use the boto3 library with no issues on same pdfs.

Usage:

document = extractor.start_document_analysis(
                       file_source=TRIAL_DOCUMENT_S3_URI, 
                       features=[TextractFeatures.LAYOUT,TextractFeatures.TABLES],
                       s3_output_path=<some_s3_path>
                )
Belval commented 8 months ago

Hi!

Make sure to use start_document_analysis with save_image=False otherwise Textractor is pre-loading the images for visualizations which you likely do not care about.

Hopefully that is the issue.

ttruong-gilead commented 8 months ago

@Belval thank you, Indeed that's the issue. We do need the page.image to save the table as a png or xlsx. Is there any way to still do this without the memory explosion? Maybe cache to disk?

Belval commented 8 months ago

You don't need to save the images to get the dimensions, you can use page.height and page.width.

As for why it's the default, originally it was to match analyze_document and now it's to preserve backward compatibility, but I think I might add a warning for your use case.

ttruong-gilead commented 8 months ago

@Belval I'm closing this issue but I still think there needs to be a disk cache mechanism instead of saving it to memory for very large files.