dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

Issue when converting to hocr #13

Open ebriquet64 opened 6 years ago

ebriquet64 commented 6 years ago

Hello,

Thanks for your code !

I have a issue on this file when I try to convert it to hocr.

jpeg to json was done by gcvocr.sh

Find attached example, thanks for your help

MAXX.json.zip

dinosauria123 commented 6 years ago

Thank you for your comment.

This is json output bug by google.

You can see each word has eight numbers.

Some of the word missing x or y number.

Please add the number manually then convert it by gcv2hocr until segfault is vanished.

ebriquet64 commented 6 years ago

Thanks for your very quick response !

I try to find a way to have a good solution; I have tested Microsoft API but JSON format is not same as Goole: Any way to use it ?

Thanks for your help

dinosauria123 commented 6 years ago

Thank you for comment.

hOCR file need word (or letter) and its coordinate.

You have to do extract word and coordinate from MS json output, then rearrange them hOCR format.

My C cord not good but it help to understand what going on.

If you prefer python, the python port is also included in gcv2hocr sources.