aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

Update focus_acc computation to discards invalid examples from the average #15

Closed vprecup closed 2 years ago

vprecup commented 2 years ago

Description of changes: While working with the solution and delving into the details of how the focus_acc metric is computed for model validation, I realised that there are situations when examples do not contain any focus tokens, but are still "captured" in this metric.

Although these examples are excluded from the focus token sums (i.e. focus_acc_by_example - line 451 from ner.py), when the focus_acc_by_exampleis averaged intofocus_acc, the total number of validation examples is used (n_examples = probs_raw.shape[0]) instead of then_focus_tokens_by_example` elements that are not 0.

By nature of the focus_acc, I understand that this metric only accounts for non-default labelled or predicted tokens, hence I concluded that it should not take into consideration examples where there are no such tokens. So I am proposing this change. @athewsey, I look forward to finding out your POV on this.

Additionally, the PR also introduces the n_focus_examples metric that will be captured in CloudWatch.

Testing done: Solution execution with and without the change - analysis of the metrics via CloudWatch.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

athewsey commented 2 years ago

Hi & thanks for this!

I agree the change makes sense, to make the metric more fairly comparable between datasets where the proportion of zero-entity examples is different.

Any div/0 risk would seem to require a user to provide a validation dataset with 0 examples of any entities - in which case things breaking would hopefully be obvious, and maybe even beneficial to help user notice the issue.

I also like surfacing the number of "focus" examples to the user via the extra metric, which may be useful in some cases for people trying to understand how accuracy is interacting with the propensity of the model to predict all-"other", to produce the final score.

Was going to ask why not n_focus_examples = (n_focus_tokens_by_example != 0).sum(), but from a quick timeit test on a small dummy array, seems like your slice-and-shape method is a fair bit faster? 🤯

So all looks good to me & happy to merge 😁

vprecup commented 2 years ago

Excellent! Thanks for the feedback, @athewsey! Also, thanks for providing this super nice solution. I've learned a lot about the SageMaker & HuggingFace ecosystems thanks to it.