dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

gcv2ocr2.py output - Correct bbox for individual words, but "ocr_lines" completely busted #38

Open hengyu95 opened 3 years ago

hengyu95 commented 3 years ago

I had to manually specify the page_width and page_height to match my PDF images to get the words to align. I am sure the words are perfectly aligned by manually checking the coordinates for each word, but the ocr_lines have coordinates that seem to follow the coordinates of the last word of the previous sentence like so:

#sentence 1
<span class='ocrx_word' id='word_2_1_50' title='bbox 658 495 664 518'>”</span>
                <span class='ocrx_word' id='word_2_1_51' title='bbox 675 495 691 518'>of</span>
                <span class='ocrx_word' id='word_2_1_52' title='bbox 698 495 785 518'>Leninism</span>
                **<span class='ocrx_word' id='word_2_1_53' title='bbox 789 495 791 518'>,</span>**
            </span>
#sentence2
            **<span class='ocr_line' id='line_2_1_4' title='bbox 789 495 791 518; baseline 0 0'>**
                <span class='ocrx_word' id='word_2_1_54' title='bbox 120 522 172 548'>social</span>
                <span class='ocrx_word' id='word_2_1_55' title='bbox 183 522 283 548'>democracy</span>
                <span class='ocrx_word' id='word_2_1_56' title='bbox 285 522 289 548'>,</span>
                <span class='ocrx_word' id='word_2_1_57' title='bbox 297 522 316 548'>or</span>

I haven't been able to figure out the significance of "baseline", should I be tweaking those to get correct lines?

OttomanZ commented 2 years ago

Hey @hengyu95 Quick Question! Is this bug still there in gcv2hocr2.py if no, then can you share some code outline or a gist to your edited script. I have updated my own to incorporate many improvement and I am interested in yours too. Share it here so I can improve. :)