dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

characters or words are being written as lines #31

Closed prashantguleria closed 4 years ago

prashantguleria commented 4 years ago

When running hocr-lines on the produced hocr file, each character and word is displayed as separate line. Libraries like PDF-BOX aren't able to search multiple words together.

Attaching scrrenshot of hocr-lines of output hocr present in sample directory. hocr-lines-output

dinosauria123 commented 4 years ago

Thank you for your report. gcv2hocr may be not output correct hocr files. Which version did you used, C or Python ? I think python version out put is better than C version. I don't have plan to fix this issue because it is sufficient to make searchable pdf.

prashantguleria commented 4 years ago

@dinosauria123 Thanks for your reply. I used python version only. I did correct the issue but I did in JAVA for microsft OCR output. I think google doesn't provide the information of line bboxes but microsft provides all the necessary information.