aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
389 stars 142 forks source link

access extra functionality like in amazon-textract-response-parser #179

Open bvbg1 opened 1 year ago

bvbg1 commented 1 year ago

In https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md I can see several features that I'd like to access from amazon-textract-textractor.

Specifically:

Is this possible?

bvbg1 commented 1 year ago

Getting the table headers with textractor does not seem possible? After calling document.to_trp2() (as documented in https://aws-samples.github.io/amazon-textract-textractor/notebooks/interfacing_with_trp2.html#Getting-the-trp2-document) I get:

AttributeError: 'Table' object has no attribute 'get_header_field_names'

I can only get it with a second call to Textract and then using trp2: textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])

Is there another way?

Edit, also mentioned here: https://github.com/aws-samples/amazon-textract-code-samples/issues/38

ThomasDelteil commented 1 year ago
bvbg1 commented 1 year ago
  • Blocks should be ordered by geometry y-axis by default, if that's not the case somewhere please file a bug

Thanks for clarifying.

  • We could add a helper function on the page document for that

+1

  • Table merging will require further work, one workaround would be to do it in pandas where it is easier

Not sure what you mean by "do it in pandas"? Can you please provide an example?

  • You can already iterate through the words of the .key and .value elements and get the confidence score, i'll add a ocr_confidence property helper if that's useful to you

Yes please!

  • Table header will be added and be available at the cell level in the .is_column_header property and added to the table visualization

Do you have an ETA for this?

ThomasDelteil commented 1 year ago

Table merging in pandas:

multipage.pdf

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.TABLES],
    s3_upload_path='s3://textractor-tests/debug/',
    s3_output_path='s3://textractor-tests/debug/',
    save_image=True
)
df1 = document1.tables[0].to_pandas(use_columns=True)
df2 = document1.tables[1].to_pandas()
df2.columns = df1.columns
df1.append(df2)

Screen Shot 2023-03-01 at 10 15 15 AM

ThomasDelteil commented 1 year ago

Column header visualization at the table level and .is_column_header have been added at the cell level. .ocr_confidence has been added to the KV element.

bvbg1 commented 1 year ago

Table merging in pandas:

multipage.pdf

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

document1 = extractor.start_document_analysis(    
    file_source='./multipage.pdf',    
    features=[TextractFeatures.TABLES],
    s3_upload_path='s3://textractor-tests/debug/',
    s3_output_path='s3://textractor-tests/debug/',
    save_image=True
)
df1 = document1.tables[0].to_pandas(use_columns=True)
df2 = document1.tables[1].to_pandas()
df2.columns = df1.columns
df1.append(df2)

Screen Shot 2023-03-01 at 10 15 15 AM

Thanks but this is merging the tables without any "checks".

What I meant is replicate the heuristic/AI checking part as well as described in: https://aws.amazon.com/blogs/machine-learning/postprocessing-with-amazon-textract-multi-page-table-handling/ https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/trp/t_pipeline.py#L119

bvbg1 commented 1 year ago

@ThomasDelteil Any updates on this?

bvbg1 commented 1 year ago

Just for tracking, it would be useful if those got added as well:

bvbg1 commented 1 year ago

Are there any updates?

dannellyz commented 1 year ago

Would also be curious on support for the multi-page tables. Since this is available in trp why would it not be backwards compatible?

bvbg1 commented 1 year ago

Is there an ETA for the remaining features?