Get confidence scores Data Frame

aws-samples / amazon-textract-response-parser

Parse JSON response of Amazon Textract

Apache License 2.0

212 stars 95 forks source link

Get confidence scores Data Frame #159

Open GabrielOttdeMedeiros opened 11 months ago

GabrielOttdeMedeiros commented 11 months ago

Hi all, I have just recently started to work with Textract and I think a simple feature could be added. This took me a while to figure out... I was interested in getting the confidence scores directly in a Data Frame, so then I wrote the following script to get it:

def get_scores_table(document):

      table = [["" for _ in range(document.column_count)] for _ in range(document.row_count)]

      for cell in document.table_cells:
          table[cell.row_index - 1][cell.col_index - 1] = " ".join([str(w.confidence) for w in cell.words]) 

      return pd.DataFrame(table)

It is indeed a very simple script, but the documentation could have a function or an option to a function that extracts such scores. It took me a while to figure out even thought I believe this should be a main feature in the documentation given that corporations, when building projects using tools like Textract, will be concerned about the accuracy of results.

Yes it is accurate, but being able to easily display these confidences is a feature that more people might be seeking out there.

athewsey commented 1 month ago

Hi and thanks for raising this,

I see a couple of challenges with this proposal:

We try to keep TRP dependencies as lean as possible (currently just boto3 and marshmallow), so requiring Pandas would be a significant decision
I've seen similar requests in the past, but users always want slightly different information in the table 😅 It seems hard to find a general API that would satisfy everybody with this kind of ask

I think more viable options would be:

Do you see ways we could clarify the docs, to help users build this kind of transform faster?
Maybe we could link to formal code examples somewhere, showing common use-cases? (Actually @Belval do we have anything like this for the Python library already? For JS it's in this repo already)
If there are any specific parts of the TRP API you found confusing while trying to build this out, maybe we could have a look at those? Would be great to hear feedback on what you tried first / found difficult to adjust to

GabrielOttdeMedeiros commented 1 month ago

@athewsey The code I used to generate the python function above can be found under this link: https://github.com/aws-samples/amazon-textract-textractor/blob/e5051e53c062f8af60ec5fa9445affb0c7485f7b/textractor/entities/table.py#L460

I was thinking the to_pandas() function could have a few attributes that allows the user to select what portion of the extracted text to retrieve from the document.table variable.

As an example:

def to_pandas(value: str, agg: str):

value: This would allow the user to choose between confidence scores, the actual values, or the bounding boxes for each cell in the table.
agg: This would allow the user to choose wether to aggregate the confidence scores by each character or word.

The same concept could be applied for the to_excel and to_text functions. I can write a simple mock code that achieves this if you would like to see what I have in mind.

Thanks for looking into my suggestion!