Open kostabasis opened 7 months ago
Can you share an asset that exhibits this behavior? It should be addressed by https://github.com/aws-samples/amazon-textract-textractor/pull/298/ but I would like to make sure before deploying it.
Unfortunately I'm working with sensitive documents, and my boss told me we cannot share them. We re-tested and it seems that for the majority of documents it works now - only some are buggy, and we can't seem to repro on anything that we can share. However, I can give some context. I suspect its because the buggy documents almost look like they are two column documents, they are composed of a list of labels, to the right of which there is the corresponding information. Sometimes, this information is a table(for example a table of rents over time). Like so: _ represents a space, for formatting
In this example, the months table is parsed correctly, but in the linearized layout it is relegated to the end of the document. This document is also kind of like a form, but enabling the form feature messes up the output as well, and it doesn't detect that the table is associated with the "Rents" key. Besides, its more of a list and there is really no need for key-value pairs here. Any suggestions are welcome, if this is more of a usage error than library shortcoming.
That's fine, to test locally you can do:
git clone git@github.com:aws-samples/amazon-textract-textractor.git
cd amazon-textract-textractor
git checkout origin version-1.7.0
pip install -e .
When testing you may want to visualize the results document.pages[0].layouts.visualize().save("out.png")
, what often happens when table are relegated to the end is that they overlap with a larger layout element such as a LAYOUT_KEY_VALUES, this "pushes" the table lower in the linearized text.
If you find an asset that you can share, feel free to send it directly at belvae [AT] amazon.com.
Hope that helps!
Awesome, thanks for all the help!
I'm also seeing an issue like this using amazon-textract-textractor
1.7.2.
I noticed that closest_reading_order_distance
in Page#get_text_and_words
isn't being set anywhere, so the if statement checking closest_reading_order_distance is None
is always true. This results in the unsorted layouts being inserted after the same layout on each iteration (because the unsorted layouts being inserted have a reading order of -1
), so the last element is always the same one. https://github.com/aws-samples/amazon-textract-textractor/blob/5ea39f8e1621836d0d357666d651aa88630dbbcb/textractor/entities/page.py#L159-L179
I tried setting closest_reading_order_distance = dist
inside the if statement after L173 (I'm assuming that was the intention), but that breaks the reading order pretty much everywhere AFAICT.
Running get_layout_csv_from_trp2 with the same textract response results in the correct reading order (but in CSV format).
Hi @stevehodgkiss, what Textract features are used to get your Textract response? The code you shared does seem to have the bug you described, but I am not convinced that this is the cause of the behaviour that you are describing.
Could you possibly provide the response itself and the original asset(s)? Reproducing the issue helps with troubleshooting.
Hi @Belval, I'm running start_document_analysis with "QUERIES", "SIGNATURES", "LAYOUT", "TABLES", "FORMS"
. I'll send you the specific page & response that causes the issue by email. I believe it could also be related to the way textract itself is generating the response (there's a layout element within a list that appears larger than it should be AFAICT), but it's interesting that get_layout_from_trpc2
can correctly linearize it.
I had the same experience. get_layout_from_trpc2
on my document gave a correctly linearized CSV.
If you extract both LAYOUT and TABLEs, the tables for some reason are printed at the end of the output, rather than linearized correctly. Related issue: https://github.com/aws-samples/amazon-textract-textractor/issues/274 My code: `from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor extractor = Textractor(profile_name="default")
document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )
config = TextLinearizationConfig( title_prefix="# ", section_header_prefix="## ",
add_prefixes_and_suffixes_in_text=True, table_tabulate_format="fancy_grid".lower(), table_remove_column_headers=True, )
extracted_text = document.get_text(config=config) print(get_text_from_layout_json(textract_json=document.response, generate_markdown=True)[1])`