Open tb102122 opened 2 years ago
@schadem I would suggest to extend the function call to accept an AreaSelection so that it can be passed into the call self.ocrdb.select_text( textract_doc_uuid=self.textract_doc_uuid, text=make_alphanum_and_lower_for_non_numbers(phrase), )
in the for loop I would remove line 1117.
Let me know if it is correct than I work on the PR.
blast from the past...
The find_phrase_in_lines
https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L841
was my first implementation to find a phrase and essentially is replaced by find_phrase_on_page
https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L769
I see find_intersect_value still uses the "lines" one like here, but I think that can be replaced with the phrases one
Tests use the lines method as well.
Essentially the "lines" method iterates over the trp object to find a match vs the "find_phrase_on_page" does use the in-memory sqlite. Unless you find good use for the lines method, I would recommend to remove it. The 'area' related methods all go back to DB anyway.
@tb102122 Thoughts?
@schadem yes sounds like a good approach I have added a warning for depreciation for now that we don't have breaking changes for other users.
@schadem I found one scenario which is not working for the function "find_phrase_on_page". If you are looking for a phrase like this Seite 1 und 2 der Kalkulation the result is not returned correctly. What I can see that this happens due to the string cleaning in line 786. At least it does not fine it in the line search in the search via the words it works.
My suggestion would be that we add a flag "clean phrase" with default True and when False we phase in the phrase without cleaning just in lower case. What do you think?
Interesting. Do you have a sample page or Textract JSON for the "Seite 1 und 2 der Kalkulation"?
I can only share a very stripped down version if that works for you since the original documents contain a lot of PIA details. Let me know if that helps.
Thx. Any example I can build a unit test for helps.
@schadem Sorry took a bit longer to get the version without PIA details. Sample_redacted.pdf
The page number is overwritten if you pass it to the function within the for loop. Plus the page number is not considered as search criteria.
Source Code Snipped from line 1091ff