Open minmin-intel opened 1 year ago
I'd recommend using AzureReadAPI with setting readingOrder="natural".
Thank you @NielsRogge Is there any open-source OCR engine that can produce segment level outputs?
If you look at the tesseract ocr engine and its tsv output format you can see that it contains columns called block_num and par_num. You should be able to use these to determine the blocks of texts belowing together.
It depends a bit on the layout engine you define (--psm parameter) how the layout analysis would work.
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 640 500 -1
2 1 1 0 0 0 61 41 513 372 -1
3 1 1 1 0 0 61 41 513 372 -1
4 1 1 1 1 0 65 41 450 30 -1
5 1 1 1 1 1 65 41 46 20 96.063751 The
5 1 1 1 1 2 128 42 89 24 95.965691 (quick)
5 1 1 1 1 3 235 43 95 25 95.835831 [brown]
5 1 1 1 1 4 349 44 66 25 94.899742 {fox}
5 1 1 1 1 5 429 45 86 26 96.683357 jumps!
4 1 1 1 2 0 65 72 490 31 -1
5 1 1 1 2 1 65 72 60 20 96.912064 Over
5 1 1 1 2 2 140 73 37 20 96.887390 the
5 1 1 1 2 3 194 73 139 24 93.263031 $43,456.78
5 1 1 1 2 4 350 76 85 25 90.893219 <lazy>
5 1 1 1 2 5 451 77 44 19 96.820717 #90
5 1 1 1 2 6 511 78 44 25 96.538940 dog
4 1 1 1 3 0 64 103 458 26 -1
@ArlindNocaj Thanks! This is helpful
Hi @NielsRogge . Thanks for all these informations. I got into trouble using my own OCR together with the LiLT model. Everything is ok during the training part (following your notebooks) but I can't make it work during inference. The problem lies in the encoding format I give to the model (output=model(**encoding)). Could you please give me some clews on this issue.
Thanks for pointing out the perf impact by OCR on LiLT in your repo https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT where you mentioned " Please always use an OCR engine that can recognize segments, and use the same bounding boxes for all words that make up a segment. This will greatly improve performance."
I tried LayoutLMv3 feature extractor with its default OCR (which I believe is Google tesseract OCR), but I found that the bounding boxes are per word, not per segment. Could you refer me to the segment-level OCR you used?