aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
381 stars 138 forks source link

Layout Linearization Duplicates text and Relegates Tables to the End #297

Open kostabasis opened 7 months ago

kostabasis commented 7 months ago

If you extract both LAYOUT and TABLEs, the tables for some reason are printed at the end of the output, rather than linearized correctly. Related issue: https://github.com/aws-samples/amazon-textract-textractor/issues/274 My code: `from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )

config = TextLinearizationConfig( title_prefix="# ", section_header_prefix="## ",
add_prefixes_and_suffixes_in_text=True, table_tabulate_format="fancy_grid".lower(), table_remove_column_headers=True, )

extracted_text = document.get_text(config=config) print(get_text_from_layout_json(textract_json=document.response, generate_markdown=True)[1])`

Belval commented 7 months ago

Can you share an asset that exhibits this behavior? It should be addressed by https://github.com/aws-samples/amazon-textract-textractor/pull/298/ but I would like to make sure before deploying it.

kostabasis commented 7 months ago

Unfortunately I'm working with sensitive documents, and my boss told me we cannot share them. We re-tested and it seems that for the majority of documents it works now - only some are buggy, and we can't seem to repro on anything that we can share. However, I can give some context. I suspect its because the buggy documents almost look like they are two column documents, they are composed of a list of labels, to the right of which there is the corresponding information. Sometimes, this information is a table(for example a table of rents over time). Like so: _ represents a space, for formatting

  1. LEASE a. Individual leasing_Harry Potter b. Date__01/01/01 c. Rents _____ __Months _Amount __07-09/1900_$25 __09-12/1900_$25 __01-12/1902_$35 __01-12/1903_$40 __01-12/1904_$50 d. Landlord_____Tom Riddle e. Terms. __
    • ... ...

In this example, the months table is parsed correctly, but in the linearized layout it is relegated to the end of the document. This document is also kind of like a form, but enabling the form feature messes up the output as well, and it doesn't detect that the table is associated with the "Rents" key. Besides, its more of a list and there is really no need for key-value pairs here. Any suggestions are welcome, if this is more of a usage error than library shortcoming.

Belval commented 7 months ago

That's fine, to test locally you can do:

  1. git clone git@github.com:aws-samples/amazon-textract-textractor.git
  2. cd amazon-textract-textractor
  3. git checkout origin version-1.7.0
  4. pip install -e .
  5. Test

When testing you may want to visualize the results document.pages[0].layouts.visualize().save("out.png"), what often happens when table are relegated to the end is that they overlap with a larger layout element such as a LAYOUT_KEY_VALUES, this "pushes" the table lower in the linearized text.

If you find an asset that you can share, feel free to send it directly at belvae [AT] amazon.com.

Hope that helps!

kostabasis commented 7 months ago

Awesome, thanks for all the help!

stevehodgkiss commented 6 months ago

I'm also seeing an issue like this using amazon-textract-textractor 1.7.2.

I noticed that closest_reading_order_distance in Page#get_text_and_words isn't being set anywhere, so the if statement checking closest_reading_order_distance is None is always true. This results in the unsorted layouts being inserted after the same layout on each iteration (because the unsorted layouts being inserted have a reading order of -1), so the last element is always the same one. https://github.com/aws-samples/amazon-textract-textractor/blob/5ea39f8e1621836d0d357666d651aa88630dbbcb/textractor/entities/page.py#L159-L179

I tried setting closest_reading_order_distance = dist inside the if statement after L173 (I'm assuming that was the intention), but that breaks the reading order pretty much everywhere AFAICT.

Running get_layout_csv_from_trp2 with the same textract response results in the correct reading order (but in CSV format).

Belval commented 6 months ago

Hi @stevehodgkiss, what Textract features are used to get your Textract response? The code you shared does seem to have the bug you described, but I am not convinced that this is the cause of the behaviour that you are describing.

Could you possibly provide the response itself and the original asset(s)? Reproducing the issue helps with troubleshooting.

stevehodgkiss commented 6 months ago

Hi @Belval, I'm running start_document_analysis with "QUERIES", "SIGNATURES", "LAYOUT", "TABLES", "FORMS". I'll send you the specific page & response that causes the issue by email. I believe it could also be related to the way textract itself is generating the response (there's a layout element within a list that appears larger than it should be AFAICT), but it's interesting that get_layout_from_trpc2 can correctly linearize it.

kostabasis commented 5 months ago

I had the same experience. get_layout_from_trpc2 on my document gave a correctly linearized CSV.