aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
382 stars 140 forks source link

issue with ordering in extractions, markdown and gettext methods #388

Open red-sky17 opened 3 weeks ago

red-sky17 commented 3 weeks ago

the attached input document contains text then a table followed by some text, we want the text file to be the same as the input pdf file.

input_page

I tried extraction using different methods:

for 1.) and 2.) this is the code I am using: textract_json = extractor.start_document_analysis( file_source="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES], save_image=False, ) response_textract_async = extractor.get_result(job_id=textract_json.job_id, api=Textract_API.ANALYZE) markdown_text = response_textract_async.to_markdown() 1.) .to_markdown() method using_markdown_method the issue here is the two table are at the bottom.

2.) .get_text() method using_gettext_method in this case as well we can see the two tables are at the bottom and like we know without config parameter we wont get markdown output.

now the third is interesting the code used for this is: from textractcaller.t_call import call_textract, Textract_Features from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[Textract_Features.LAYOUT,Textract_Features.TABLES],) 3.) get_text_from_layout_json(textract_json=textract_json) also tried with get_text_from_layout_json(textract_json=textract_json, generate_markdown = True) in both of these cases getting the same output. using_gettextfromlayout_1 using_gettextfromlayout_2

the issue in using this method is like you can see, the data is getting repeated twice, also there is no markdown format present.

@Belval or anyone can you please suggest if there is anything we can do to prevent this and get the text in correct like we have in the pdf file.

Thanks.

red-sky17 commented 3 weeks ago

also do look into this output for the attached pdf as well, same issue is being observed here as well for the 1st page the tables are being printed down and as for the second page Egypt_EG01_Credit Agricole.pdf

this is for 2nd page: second_pdf_usingmarkdown

complete text file: Egypt_EG01_Credit Agricole_using_markdown.txt

red-sky17 commented 3 weeks ago

where as the ordering is present in this text file when extracted using get_text_from_layout_json(textract_json=textract_json) the issue is same like the one discussed in the first thread (3.).

text file for reference:

Egypt_EG01_Credit Agricole_using_gettextfromlayout_json.txt

I am thinking is this a bug for .to_markdown() and get_text() methods because for gettextfromlayoutjson() we are getting the output in correct order.

ultimately the final goal is to get the extraction like we did in gettextfromlayoutjson but with markdown bordering and no duplication.

so, I believe it would be better if we could get the extraction properly by using .to_markdown method only, because in this method we have markdown bordering and the only issue is ordering which can debugged I guess by comparing the gettextfromlayoutjson and to_markdown functions code of traversing the json dict.

Belval commented 3 weeks ago

I will test it first but this looks like a known issue that happens when the LAYOUT predictions do not match the TABLE predictions, causing the reading order to be wrong.

Belval commented 3 weeks ago

What version of amazon-textract-textractor are you using? With 1.8.2 I get:

Page 2 of 10

Schneider Electric South East Asia (HQ) Pte. Ltd. Schneider Electric Overseas Asia Pte Ltd Schneider Electric Singapore Pte. Ltd. Schneider Electric IT Singapore Pte. Ltd. (formerly known as MGE Asia Pte Ltd) Schneider Electric IT Logistics Asia Pacific Pte. Ltd. Schneider Electric Logistics Asia Pte Ltd Schneider Electric Systems Singapore Pte. Ltd. (formerly known as Invensys Process Systems (S) Pte. Ltd.) 1 March 2017 

Previous Facility Letters. In the event that this Facility Letter is not accepted or lapses and is not extended by the Bank, the terms and conditions in the Previous Facility Letters shall continue to apply, save for any revision or amendments to the Interest Rate and any reduction in the amount of the Lines of Credit as stated herein. 

## A. LINE(S) OF CREDIT 

| AMOUNT          | TYPE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SGD20,000,000/- | Multi-currency Banker's Guarantee [including but not limited to Performance Guarantee or Payment Guarantee (for up to 60 months or such other tenor as may be agreed by the Bank from time to time) or to finance any other transactions acceptable to the Bank on a case-by-case subject to such conditions as may be determined by the Bank in its sole and absolute discretion] and/or Sight & Usance Letters of Credit (for up to 12 months) (with/ without control of goods) and/or Shipping Guarantee & Acceptance Under Usance Letters of Credit. |

## 1. PURPOSE 

The Facilities shall be used solely to finance the Borrower's working capital requirements. However, without prejudice to the Borrower's obligations, the Bank shall not be obliged to check that the Borrower does so or that the Facilities or any part thereof is utilized in such a manner. 

## 2. INTEREST RATE/COMMISSION/FEE 

(a) Commission on Banker's Guarantee shall be calculated on the face amount of the Banker's Guarantee for the period from the date of issuance upto the expiry date of the Banker's Guarantee, payable upfront as follows :- 

(b) Non-refundable Commission / Interest on the Trade Facilities shall be payable at the following rates and in the following manner:- 
(i) Letters of Credit 0.125% per month, minimum 2 months 

| Tenor                    | Commission    |
|--------------------------|---------------|
| Less than 3 years,       | 0.2%pa        |
| 3 years and upto 5 years | 0.25%pa       |

Which does not match what you are reporting.

red-sky17 commented 3 weeks ago

@Belval , I am attaching the input pdf, when tested on the single page like I attached( in the first thread, which you tested) its giving the same output like you got, but when tested as a whole(pdf) that is when I am facing issue.

I am using amazon-textract-textractor version 1.8.2

this_pdf.pdf

Belval commented 3 weeks ago

Thank you for clarifying and sharing the file, I will attempt to reproduce the issue.

red-sky17 commented 1 day ago

Hello @Belval, were you able to reproduce this issue.