Inconsistent text inference output with plain text

Ucas-HaoranWei / GOT-OCR2.0

Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

6.18k stars 534 forks source link

Inconsistent text inference output with plain text #225

Open ep0p opened 5 days ago

ep0p commented 5 days ago

I'm encountering an issue when using GOT for inferencing plain text. The output is not consistent: sometimes it detects the text correctly, but other times, it introduces spaces between letters, creating nonsense words:

For example:

Correct output: This is a well-detected text.
Incorrect output: Th i s i s a tex t wi thspa c e s.

This inconsistency becomes particularly problematic when processing PDFs with multiple pages. Even if most pages are inferenced correctly, a couple of pages might have this spacing issue, which disrupts the results.

I can't figure out why this happens or how to enforce a consistent format, ensuring only the "good" text format is used

paulgekeler commented 3 days ago

If the model predictions are a little off, maybe because your provided PDFs (format, content and so on..) deviate to some degree from the training material, this is nothing to worry about and not uncommon. Could be the fonts or spacing are different and therefore harder to parse correctly for the model. I'd suggest to post-process the predictions yourself in this case, using an NLP package to detect word boundaries (an idea from here) and remove the faulty spacing within those boundaries. Or you could fine-tune the model on your data, which I guess if it's just a spacing issue should be resolved quickly. I am also noticing the text is in French and a legislative text or conference protocol, which might also contribute to the problem...

ep0p commented 3 days ago

Hi @paulgekeler,

Indeed when using the fined tuned version this issue no longer exists. It is however replaced with pages being ignored, not a single word on them recognised

Do you have any idea if GOT can handle images that might be skewed? Another guess, it might be the noise and i need to fine tune with noisy images, i'll see about that.

PS: all my documents are french legal documents with, sometimes, complicated layouts.

paulgekeler commented 3 days ago

@ep0p yes, I've experienced the same thing. When I try to run multi page inference, I barely get any output. Maybe the first couple of lines of text. My suspicion is that the compression of the visual information is too much for dense text over multiple pages. I think their multi page training consisted of multiple pages of sparse text.

Ucas-HaoranWei commented 3 days ago

Hi， it would help if you use a for-loop for multi-page inference. The multi-page is only for training, more details can be found in the paper.

paulgekeler commented 3 days ago

@Ucas-HaoranWei thanks I read the paper. I will try to fine-tune some more on multi-page data.

ep0p commented 3 days ago

@paulgekeler and @Ucas-HaoranWei In my case, I split the PDF into images and performed inference in a loop, page by page. Some pages were ignored, even though they had the same format as the others. However, it seemed to me that they were slightly tilted. I deskewed them, and this apparently helped because they were properly recognized afterward.

Would fine-tuning with skewed images help in this case?

paulgekeler commented 3 days ago

@ep0p pretty sure it would. For example in Nougat and Donut as well, they distort some image pages before training to increase robustness

ep0p commented 2 days ago

@paulgekeler thanks a lot. i will add a skewed subset in my dataset as well and attempt a fine tuning

thhung commented 1 day ago

@ep0p Did you manage to finetune your dataset? If you did sucessfully, would you mind sharing the format of your data and training settings?