aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
408 stars 145 forks source link

Is search_words() broken? #371

Open ttruong-gilead opened 5 months ago

ttruong-gilead commented 5 months ago

amazon-textract-textractor==1.7.9

document.search_words(keyword="Tom Brady") or page.search_words(keyword="Frank")

doesn't work as expected. Returns a list of random letters or words not even close to keywords. Tried playing with the similarity_threshold to no avail.

RandyLef1 commented 3 months ago

same here

jdan98 commented 4 days ago

document.search_words is actually broken. The issue comes from the line below: -> _search_words_with_similarity in page.py similarity = ( similarity if similarity_metric == SimilarityMetric.COSINE else -(similarity) )

LEVENSHTEIN is a value between [0,1] COSINE is a value between [-1, 1] but -1.0 represents an inverse correlation and is therefore not compatible with the expected result. The lower bound must be set to 0.

Currently the code return a list of -0.0 for LEVENSHTEIN and to fix this bug,you must write the new line of code with:

if similarity_metric == SimilarityMetric.COSINE: similarity = 0.0 if similarity < 0.0 else similarity else similarity = similarity