Open GabrielOttdeMedeiros opened 11 months ago
Hi and thanks for raising this,
I see a couple of challenges with this proposal:
I think more viable options would be:
@athewsey The code I used to generate the python function above can be found under this link: https://github.com/aws-samples/amazon-textract-textractor/blob/e5051e53c062f8af60ec5fa9445affb0c7485f7b/textractor/entities/table.py#L460
I was thinking the to_pandas() function could have a few attributes that allows the user to select what portion of the extracted text to retrieve from the document.table variable.
As an example:
def to_pandas(value: str, agg: str):
value: This would allow the user to choose between confidence scores, the actual values, or the bounding boxes for each cell in the table.
agg: This would allow the user to choose wether to aggregate the confidence scores by each character or word.
The same concept could be applied for the to_excel and to_text functions. I can write a simple mock code that achieves this if you would like to see what I have in mind.
Thanks for looking into my suggestion!
Hi all, I have just recently started to work with Textract and I think a simple feature could be added. This took me a while to figure out... I was interested in getting the confidence scores directly in a Data Frame, so then I wrote the following script to get it:
It is indeed a very simple script, but the documentation could have a function or an option to a function that extracts such scores. It took me a while to figure out even thought I believe this should be a main feature in the documentation given that corporations, when building projects using tools like Textract, will be concerned about the accuracy of results.
Yes it is accurate, but being able to easily display these confidences is a feature that more people might be seeking out there.