Praneet9 / Representation-Learning-for-Information-Extraction

Pytorch implementation of Paper by Google Research - Representation Learning for Information Extraction from Form-like Documents.
Apache License 2.0
102 stars 29 forks source link

how to work on address candidates? #39

Open panwar2001 opened 10 months ago

panwar2001 commented 10 months ago

@Praneet9 The NER to extract address candidates is having accuracy issue and difficult to separate multiple address. Do you know any way , how to train model , example bert like model . The invoice may have address in a single line or multiple lines . The address will be situated anywhere in invoice. US or similar have unique format of address and simple regex can be used for them but what about an Indian address. Will there be any way to train invoice text and focus on these address words , through their context and not focus on words. Is There a way to train self attention / LSTM on address words based on their contextual representation to surrounding words, ml model must not focus on address words because the words can vary and have millions of variations.

Praneet9 commented 10 months ago

I agree @panwar2001 Its a very difficult problem to solve and definitely not straightforward. With respect to this model, using diverse data can be the only solution. I would suggest having data where the spatial nature is more important (like blocks of texts, etc). This cannot be learnt easily based on just text. You can also use heuristic rules that separate the texts into blocks and in doing that removes the extra text out of the address (if it is predicted partially).

panwar2001 commented 10 months ago

Yes, I observed that address in invoice always occurs in a clustered block of text or single line , with surrounding space. So used openCV to get every block of text. There are other ways also like document layout through Layoutlmv3 and document image transformer or R-CNN. Although openCV works fine for grouping text. Then from each block of text some preprocessing has to be done like removing stop words or words which are commonly found in invoice , like which are non relevant to address . Example - total , subtotal, date, etc. After it the problem is which block to consider having address or not. Sometimes face the problem when 2 addresses comes in the same block . The solution found till now is that just have some hardcoded cities / states and search it for or run NER algorithm to detect any sort of address within block of text.