aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
408 stars 146 forks source link

Visualizing words with search_words shows wrong results #196

Open Belval opened 1 year ago

Belval commented 1 year ago

In the documentation, this example: https://aws-samples.github.io/amazon-textract-textractor/notebooks/visualizing_results.html#Visualizing-the-result-of-a-search does not generate the right output.

Expected: image

Result: image

This occurs when torch is not installed (but might occur when it is installed as well).

schorndorfer commented 1 year ago

the word/line similarity code is definitely buggy. Would be nice to pass in distance metrics, e.g. from the textdistance package: https://pypi.org/project/textdistance/