access extra functionality like in amazon-textract-response-parser

aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.

Apache License 2.0

389 stars 142 forks source link

access extra functionality like in amazon-textract-response-parser #179

Open bvbg1 opened 1 year ago

bvbg1 commented 1 year ago

In https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md I can see several features that I'd like to access from amazon-textract-textractor.

Specifically:

Order blocks (WORDS, LINES, TABLE, KEY_VALUE_SET) by geometry y-axis
Page orientation in degrees
Merge or link tables across pages
Add OCR confidence score to KEY and VALUE
Getting the table headers (not mentioned in amazon-textract-response-parser) but available in Textract

Is this possible?

bvbg1 commented 1 year ago

Getting the table headers with textractor does not seem possible? After calling document.to_trp2() (as documented in https://aws-samples.github.io/amazon-textract-textractor/notebooks/interfacing_with_trp2.html#Getting-the-trp2-document) I get:

AttributeError: 'Table' object has no attribute 'get_header_field_names'

I can only get it with a second call to Textract and then using trp2: textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

Is there another way?

Edit, also mentioned here: https://github.com/aws-samples/amazon-textract-code-samples/issues/38

ThomasDelteil commented 1 year ago

Blocks should be ordered by geometry y-axis by default, if that's not the case somewhere please file a bug
We could add a helper function on the page document for that
Table merging will require further work, one workaround would be to do it in pandas where it is easier
You can already iterate through the words of the .key and .value elements and get the confidence score, i'll add a ocr_confidence property helper if that's useful to you
Table header will be added and be available at the cell level in the .is_column_header property and added to the table visualization

bvbg1 commented 1 year ago

Blocks should be ordered by geometry y-axis by default, if that's not the case somewhere please file a bug

Thanks for clarifying.

We could add a helper function on the page document for that

Table merging will require further work, one workaround would be to do it in pandas where it is easier

Not sure what you mean by "do it in pandas"? Can you please provide an example?

You can already iterate through the words of the .key and .value elements and get the confidence score, i'll add a ocr_confidence property helper if that's useful to you

Yes please!

Table header will be added and be available at the cell level in the .is_column_header property and added to the table visualization

Do you have an ETA for this?

ThomasDelteil commented 1 year ago

Table merging in pandas:

multipage.pdf

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.TABLES],
    s3_upload_path='s3://textractor-tests/debug/',
    s3_output_path='s3://textractor-tests/debug/',
    save_image=True
)
df1 = document1.tables[0].to_pandas(use_columns=True)
df2 = document1.tables[1].to_pandas()
df2.columns = df1.columns
df1.append(df2)

Screen Shot 2023-03-01 at 10 15 15 AM

ThomasDelteil commented 1 year ago

Column header visualization at the table level and .is_column_header have been added at the cell level. .ocr_confidence has been added to the KV element.

bvbg1 commented 1 year ago

Table merging in pandas:

multipage.pdf

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.TABLES],
    s3_upload_path='s3://textractor-tests/debug/',
    s3_output_path='s3://textractor-tests/debug/',
    save_image=True
)
df1 = document1.tables[0].to_pandas(use_columns=True)
df2 = document1.tables[1].to_pandas()
df2.columns = df1.columns
df1.append(df2)

Screen Shot 2023-03-01 at 10 15 15 AM

Thanks but this is merging the tables without any "checks".

What I meant is replicate the heuristic/AI checking part as well as described in: https://aws.amazon.com/blogs/machine-learning/postprocessing-with-amazon-textract-multi-page-table-handling/ https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/trp/t_pipeline.py#L119

bvbg1 commented 1 year ago

@ThomasDelteil Any updates on this?

bvbg1 commented 1 year ago

Just for tracking, it would be useful if those got added as well:

Page orientation in degrees
Merge or link tables across pages

bvbg1 commented 1 year ago

Are there any updates?

dannellyz commented 1 year ago

Would also be curious on support for the multi-page tables. Since this is available in trp why would it not be backwards compatible?

bvbg1 commented 1 year ago

Is there an ETA for the remaining features?