Open sarepal opened 4 years ago
Thank you for your report. Did you use gcvocr.sh to get json file ?
No, I used a script based on a Google Cloud Vision tutorial. I'll look into using the shell script instead.
@sarepal @dinosauria123 Any update on how to convert above attached json file to hocr. Thanks in advance
Update: I got the correct API key to generate the json using gcvocr.sh and was able to convert it to hocr with gcv2ocr.py.
However, I noticed in the hocr output that there is a <span class='ocr_line'....> around every word instead of every line of text.
@dinosauria123 does gcv2ocr.py only deal with the data in the json's "textAnnotations" and not the data in "fullTextAnnotation"? Thanks.
I see that gcv2hocr2.py does handle fullTextAnnotation. When I try to run it this is the output I receive:
python ../gcv2hocr2.py osh-sample-1911a-0001.jpg.json > output.hocr
Traceback (most recent call last):
File "../gcv2hocr2.py", line 184, in <module>
page = fromResponse(resp, str(args.gcv_file.rsplit('.',1)[0]), **args.__dict__)
File "../gcv2hocr2.py", line 103, in fromResponse
for page_id, page_json in enumerate(resp['fullTextAnnotation']['pages']):
KeyError: 'fullTextAnnotation'
The JSON does contain a fullTextAnnotation object so I don't know why this error would occur. I'm attaching the JSON I tried to process. If there's a way to get this script to successfully run, I would be very grateful. Thanks again. osh-sample-1911a-0001.jpg.json.zip
UPDATE: I now have gcv2hocr2.py working. I just edited line 103 to this and it worked:
for page_id, page_json in enumerate(resp['responses'][0]['fullTextAnnotation']['pages']):
I'm working with the attached JSON file from GCV but when I run the gcv2ocr.py, the hocr only has metadata and lacks content. osh-sample-1911a-0001.json.zip