Open bvbg1 opened 1 year ago
Getting the table headers with textractor does not seem possible? After calling document.to_trp2() (as documented in https://aws-samples.github.io/amazon-textract-textractor/notebooks/interfacing_with_trp2.html#Getting-the-trp2-document) I get:
AttributeError: 'Table' object has no attribute 'get_header_field_names'
I can only get it with a second call to Textract and then using trp2: textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])
Is there another way?
Edit, also mentioned here: https://github.com/aws-samples/amazon-textract-code-samples/issues/38
.key
and .value
elements and get the confidence score, i'll add a ocr_confidence
property helper if that's useful to you.is_column_header
property and added to the table visualization
- Blocks should be ordered by geometry y-axis by default, if that's not the case somewhere please file a bug
Thanks for clarifying.
- We could add a helper function on the page document for that
+1
- Table merging will require further work, one workaround would be to do it in pandas where it is easier
Not sure what you mean by "do it in pandas"? Can you please provide an example?
- You can already iterate through the words of the
.key
and.value
elements and get the confidence score, i'll add aocr_confidence
property helper if that's useful to you
Yes please!
- Table header will be added and be available at the cell level in the
.is_column_header
property and added to the table visualization
Do you have an ETA for this?
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(profile_name="default")
document1 = extractor.start_document_analysis(
file_source='./multipage.pdf',
features=[TextractFeatures.TABLES],
s3_upload_path='s3://textractor-tests/debug/',
s3_output_path='s3://textractor-tests/debug/',
save_image=True
)
df1 = document1.tables[0].to_pandas(use_columns=True)
df2 = document1.tables[1].to_pandas()
df2.columns = df1.columns
df1.append(df2)
Column header visualization at the table level and .is_column_header
have been added at the cell level.
.ocr_confidence
has been added to the KV element.
Table merging in pandas:
from textractor import Textractor from textractor.data.constants import TextractFeatures extractor = Textractor(profile_name="default") document1 = extractor.start_document_analysis( file_source='./multipage.pdf', features=[TextractFeatures.TABLES], s3_upload_path='s3://textractor-tests/debug/', s3_output_path='s3://textractor-tests/debug/', save_image=True ) df1 = document1.tables[0].to_pandas(use_columns=True) df2 = document1.tables[1].to_pandas() df2.columns = df1.columns df1.append(df2)
Thanks but this is merging the tables without any "checks".
What I meant is replicate the heuristic/AI checking part as well as described in: https://aws.amazon.com/blogs/machine-learning/postprocessing-with-amazon-textract-multi-page-table-handling/ https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/trp/t_pipeline.py#L119
@ThomasDelteil Any updates on this?
Just for tracking, it would be useful if those got added as well:
Are there any updates?
Would also be curious on support for the multi-page tables. Since this is available in trp why would it not be backwards compatible?
Is there an ETA for the remaining features?
In https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md I can see several features that I'd like to access from amazon-textract-textractor.
Specifically:
Is this possible?