aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
407 stars 145 forks source link

Empty expense_documents on analyze_expense #370

Closed arsher-b closed 5 months ago

arsher-b commented 5 months ago

The analyze_expense return 0 on expense_documents even the image is a receipt. Here's the receipt used: 2405260266_2

image

Output: image

Can someone fix this where it still parse the receipt as line items inside the Expense Documents instead of Lines?

athewsey commented 5 months ago

Verified I can reproduce this - the document works in the Textract console and the API response contains a non-empty ExpenseDocuments.

The problem appears to be this line (with a FIXME), which specifically ignores any ExpenseDocuments entry with no SummaryFields: The example doc contains no summary fields, only line items.

Presumably, this is because Textractor needs to determine the page number of the expense document immediately after (even though the Textract response API structure does not specifically tie an ExpenseDocument to one Page, and appears to be designed to support multi-page invoices?). Currently it does this by looking for a PageNumber annotation on the first summary field.

As a minimally-impacting fix, I think it should also be possible to try fetching this information via:

textract_json["ExpenseDocuments"][0]["LineItemGroups"][0]["LineItems"][0]["LineItemExpenseFields"][0]["PageNumber"]

...But may also be worth considering whether Textractor should continue framing ExpenseDocuments as within a single page, rather than potentially spanning multiple pages?

athewsey commented 5 months ago

Hi @arsher-b, we believe this should now be solved with the v1.8.0 release. It'd be great if you could help confirm?

arsher-b commented 5 months ago

Hello Sir @athewsey, I've confirmed that the issue is resolved with the v1.8.0 release. Thanks!