dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
103 stars 31 forks source link

gcv2ocr.py does not convert json #35

Open sarepal opened 4 years ago

sarepal commented 4 years ago

I'm working with the attached JSON file from GCV but when I run the gcv2ocr.py, the hocr only has metadata and lacks content. osh-sample-1911a-0001.json.zip

dinosauria123 commented 4 years ago

Thank you for your report. Did you use gcvocr.sh to get json file ?

sarepal commented 4 years ago

No, I used a script based on a Google Cloud Vision tutorial. I'll look into using the shell script instead.

svamsip commented 4 years ago

@sarepal @dinosauria123 Any update on how to convert above attached json file to hocr. Thanks in advance

sarepal commented 4 years ago

Update: I got the correct API key to generate the json using gcvocr.sh and was able to convert it to hocr with gcv2ocr.py.

However, I noticed in the hocr output that there is a <span class='ocr_line'....> around every word instead of every line of text.

@dinosauria123 does gcv2ocr.py only deal with the data in the json's "textAnnotations" and not the data in "fullTextAnnotation"? Thanks.

sarepal commented 4 years ago

I see that gcv2hocr2.py does handle fullTextAnnotation. When I try to run it this is the output I receive:

python ../gcv2hocr2.py osh-sample-1911a-0001.jpg.json > output.hocr

Traceback (most recent call last):
  File "../gcv2hocr2.py", line 184, in <module>
    page = fromResponse(resp, str(args.gcv_file.rsplit('.',1)[0]), **args.__dict__)
  File "../gcv2hocr2.py", line 103, in fromResponse
    for page_id, page_json in enumerate(resp['fullTextAnnotation']['pages']):
KeyError: 'fullTextAnnotation'

The JSON does contain a fullTextAnnotation object so I don't know why this error would occur. I'm attaching the JSON I tried to process. If there's a way to get this script to successfully run, I would be very grateful. Thanks again. osh-sample-1911a-0001.jpg.json.zip

sarepal commented 4 years ago

UPDATE: I now have gcv2hocr2.py working. I just edited line 103 to this and it worked:

for page_id, page_json in enumerate(resp['responses'][0]['fullTextAnnotation']['pages']):