Closed prashantguleria closed 4 years ago
Thank you for your report. gcv2hocr may be not output correct hocr files. Which version did you used, C or Python ? I think python version out put is better than C version. I don't have plan to fix this issue because it is sufficient to make searchable pdf.
@dinosauria123 Thanks for your reply. I used python version only. I did correct the issue but I did in JAVA for microsft OCR output. I think google doesn't provide the information of line bboxes but microsft provides all the necessary information.
When running hocr-lines on the produced hocr file, each character and word is displayed as separate line. Libraries like PDF-BOX aren't able to search multiple words together.
Attaching scrrenshot of hocr-lines of output hocr present in sample directory.