NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.61k stars 1.47k forks source link

segment-level OCR for LiLT #235

Open minmin-intel opened 1 year ago

minmin-intel commented 1 year ago

Thanks for pointing out the perf impact by OCR on LiLT in your repo https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT where you mentioned " Please always use an OCR engine that can recognize segments, and use the same bounding boxes for all words that make up a segment. This will greatly improve performance."

I tried LayoutLMv3 feature extractor with its default OCR (which I believe is Google tesseract OCR), but I found that the bounding boxes are per word, not per segment. Could you refer me to the segment-level OCR you used?

NielsRogge commented 1 year ago

I'd recommend using AzureReadAPI with setting readingOrder="natural".

minmin-intel commented 1 year ago

Thank you @NielsRogge Is there any open-source OCR engine that can produce segment level outputs?

ArlindNocaj commented 1 year ago

If you look at the tesseract ocr engine and its tsv output format you can see that it contains columns called block_num and par_num. You should be able to use these to determine the blocks of texts belowing together.

It depends a bit on the layout engine you define (--psm parameter) how the layout analysis would work.

level   page_num        block_num       par_num line_num        word_num        left    top     width   height  conf    text
1       1       0       0       0       0       0       0       640     500     -1
2       1       1       0       0       0       61      41      513     372     -1
3       1       1       1       0       0       61      41      513     372     -1
4       1       1       1       1       0       65      41      450     30      -1
5       1       1       1       1       1       65      41      46      20      96.063751       The
5       1       1       1       1       2       128     42      89      24      95.965691       (quick)
5       1       1       1       1       3       235     43      95      25      95.835831       [brown]
5       1       1       1       1       4       349     44      66      25      94.899742       {fox}
5       1       1       1       1       5       429     45      86      26      96.683357       jumps!
4       1       1       1       2       0       65      72      490     31      -1
5       1       1       1       2       1       65      72      60      20      96.912064       Over
5       1       1       1       2       2       140     73      37      20      96.887390       the
5       1       1       1       2       3       194     73      139     24      93.263031       $43,456.78
5       1       1       1       2       4       350     76      85      25      90.893219       <lazy>
5       1       1       1       2       5       451     77      44      19      96.820717       #90
5       1       1       1       2       6       511     78      44      25      96.538940       dog
4       1       1       1       3       0       64      103     458     26      -1
minmin-intel commented 1 year ago

@ArlindNocaj Thanks! This is helpful

mariobrosse44140 commented 1 year ago

Hi @NielsRogge . Thanks for all these informations. I got into trouble using my own OCR together with the LiLT model. Everything is ok during the training part (following your notebooks) but I can't make it work during inference. The problem lies in the encoding format I give to the model (output=model(**encoding)). Could you please give me some clews on this issue.