aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
389 stars 142 forks source link

page number is overwritten in function find_phrase_in_lines #94

Open tb102122 opened 2 years ago

tb102122 commented 2 years ago

The page number is overwritten if you pass it to the function within the for loop. Plus the page number is not considered as search criteria.

Source Code Snipped from line 1091ff


def find_phrase_in_lines(
        self, phrase: str, min_textdistance=0.6, page_number: int = 1
    ) -> List[TWord]:
        """
        phrase = words seperated by space char
        """
        # first check if we already did find this phrase and stored it in the DB
        # TODO: Problem: it will not find Current: when the phrase has current and there are other current values in the document without :
        if not phrase:
            raise ValueError(f"no valid phrase: '{phrase}")
        phrase_words = phrase.split(" ")
        if len(phrase_words) < 1:
            raise ValueError(f"no valid phrase: '{phrase}")
        # TODO: check for page_number impl
        found_phrases: "list[TWord]" = self.ocrdb.select_text(
            textract_doc_uuid=self.textract_doc_uuid,
            text=make_alphanum_and_lower_for_non_numbers(phrase),
        )
        print("after ocrdb.select_text")
        if found_phrases:
            print("phrases found")
            return found_phrases

        alphanum_regex = re.compile(r"[\W_]+")
        # find phrase (words that follow each other) in trp lines
        for page in self.doc.pages:
            page_number = 1
            for line in page.lines:
......
`
tb102122 commented 2 years ago

@schadem I would suggest to extend the function call to accept an AreaSelection so that it can be passed into the call self.ocrdb.select_text( textract_doc_uuid=self.textract_doc_uuid, text=make_alphanum_and_lower_for_non_numbers(phrase), ) in the for loop I would remove line 1117.

Let me know if it is correct than I work on the PR.

schadem commented 2 years ago

blast from the past...

The find_phrase_in_lines https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L841

was my first implementation to find a phrase and essentially is replaced by find_phrase_on_page https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L769

I see find_intersect_value still uses the "lines" one like here, but I think that can be replaced with the phrases one

https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L320

Tests use the lines method as well.

Essentially the "lines" method iterates over the trp object to find a match vs the "find_phrase_on_page" does use the in-memory sqlite. Unless you find good use for the lines method, I would recommend to remove it. The 'area' related methods all go back to DB anyway.

@tb102122 Thoughts?

tb102122 commented 2 years ago

@schadem yes sounds like a good approach I have added a warning for depreciation for now that we don't have breaking changes for other users.

tb102122 commented 2 years ago

@schadem I found one scenario which is not working for the function "find_phrase_on_page". If you are looking for a phrase like this Seite 1 und 2 der Kalkulation the result is not returned correctly. What I can see that this happens due to the string cleaning in line 786. At least it does not fine it in the line search in the search via the words it works.

https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L783-L788

My suggestion would be that we add a flag "clean phrase" with default True and when False we phase in the phrase without cleaning just in lower case. What do you think?

schadem commented 2 years ago

Interesting. Do you have a sample page or Textract JSON for the "Seite 1 und 2 der Kalkulation"?

tb102122 commented 2 years ago

I can only share a very stripped down version if that works for you since the original documents contain a lot of PIA details. Let me know if that helps.

schadem commented 2 years ago

Thx. Any example I can build a unit test for helps.

tb102122 commented 2 years ago

@schadem Sorry took a bit longer to get the version without PIA details. Sample_redacted.pdf