NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.45k stars 1.32k forks source link

LayoutLMv3 - discard tokens with same meaning from inference #388

Open nk-alex opened 4 months ago

nk-alex commented 4 months ago

I’m using my own version of SROIE dataset for token classification problem using LayourLMv3. I found out a scenario where some information is repeated throughout the document. Let’s say the seller name it’s “abc def ghi” and this name is repeated three times on the document.

Character recognition is a big part of this problem, and sometimes, this process is not accurate enough. So, in this specific scenario, I get this output from the inference:

… “SellerName”: [“abc”, “def”, “ghi”, “abo”, “def”, “obc”, “dof”, “ghi”] …

So, basically, the inference is correct. This document has the seller name three times on it and all those tokens are part of the seller name.

[“abc”, “def”, “ghi”...] is the first time seller name showing on document. [...“abo”, “def”...] is the second time seller name showing on document. In this case, OCR did not recognize the las part of the seller name. [...“obc”, “dof”, “ghi”] is the third time seller name showing on document.

As seen above, I have a lot of information with the same meaning:

“abc” has the same meaning as “abo” and “obc”. “def” has the same meaning as “def” and “dof”. “ghi” has the same meaning as “ghi”

My question is the following: How can I keep just one value with the same meaning? I would like just to keep “SellerName”: [“abc”, “def”, “ghi”]. Since all the tokens received from inference are labeled as "SellerName" I don't see a clear path for this task.