dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
102 stars 31 forks source link

Could not convert json output #16

Open heroturtle opened 6 years ago

heroturtle commented 6 years ago

I tried to convert the json output on Google's page using gcv2hocr.py: https://cloud.google.com/vision/docs/ocr Traceback (most recent call last): File "gcv2hocr2.py", line 146, in page = fromResponse(resp, **args.dict) File "gcv2hocr.py", line 99, in fromResponse word.htmlid="word%d%d" % (len(page.content) - 1, len(curline.content)) AttributeError: 'NoneType' object has no attribute 'content'

Thanks

dinosauria123 commented 6 years ago

Thank you for using gcv2hocr. Please upload your json output file, I will check it.

heroturtle commented 6 years ago

Thanks for the quick reply. I used this Response from https://cloud.google.com/vision/docs/ocr: test2.jpg.json.zip

dinosauria123 commented 6 years ago

Thank you for upload your file. I could convert it to hocr file using by C version of gcv2hocr. I confirmed the conversion fails in the case of Python version. I will fix Python version. Sorry for inconvenience.

dinosauria123 commented 6 years ago

I have modified gcv2hocr.py. I hope this fix the issue.

heroturtle commented 6 years ago

Thanks for the prompt fix. It works now. May I ask you: 1) DOCUMENT_TEXT_DETECTION doesn't work yet I assume 2) I assume that for line_detection, the image needs to be deskewed. In the test sample, it worked but not in the sample I provided. In addition, the output for C and Python is slightly different. Thanks for your work.

dinosauria123 commented 6 years ago
  1. I think DOCUMENT_TEXT_DETECTION supports some language (English, etc.) but not for all.

  2. The image needs to be deskewed to get good recognition result. But I think it maybe done by the other application or command, doesn't for a part of gcv2hocr.

The output for C and Python is different. Historically, Python version is not committed by me. Python output is better than C output in the view of the hocr format (text structure). But Python output fails to place characters in the Japanese vertical text (I made gcv2hocr for this purpose), because ReportLab (this generate pdf output) does not support Japanese vertical text. So, in the case of C output, CR/LF is added every single word (characters) to save the position in the Japanese vertical text...