Closed skylord123 closed 6 years ago
I have the script compiled to exe form because our system admin didn't want to install python on the server (and I am trying to push him to go with linux but until then..). When I run the exe form of the program I get this:
When I run as a python script on the same image I get this:
Traceback (most recent call last):
File ".\gcv2hocr.py", line 144, in <module>
resp = json.load(instream)['responses'][0]
File "C:\Users\Skylar\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Users\Skylar\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 513: character maps to <undefined>
So it looks like it is having an issue because of the way gcv2hocr
expects an item to be under the responses
JSON array.
I think generating an HOCR file with no ocr_line
elements would work (looks like hocr2pdf would handle this). Now just need to update the gcv2hocr
script to do that..
Thank you for using gcv2hocr and make good patches.
My opinion is generate hocr file even google respond blank page json.
If you make a patch, I will submit it.
Sorry for inconvenience.
2018/07/31 6:27、Skylar Sadlier notifications@github.comのメール:
I think generating an HOCR file with no ocr_line elements would work (looks like hocr2pdf would handle this). Now just need to update the gcv2hocr script to do that..
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
I've been processing several thousands of PDFs using
gcv2hocr
and have been slowly fixing issues as they arise. I finally hit an issue that I could use some help fixing.I have a huge PDF that has been failing on a blank page. The google vision response is empty (because it's a blank page) which causes
gcv2hocr
to hang on the page (doesn't error out as far as I can tell because my application that is calling it hits the timeout for runninggcv2hocr
which I currently have set at 2 minutes).The problem I have is what should I do with blank pages? How should the HOCR be generated? I use
hocr2pdf
to later combine the HOCR and jpeg files into a new PDF and want to make sure the blank page remains in the new PDF. Do I generate an HOCR file with no content?Once I have advice on what direction to take I can create the PR to fix it.
Here is the blank page's google vision response: page_011.json
Here is the blank page jpg: