dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
102 stars 31 forks source link

Blank pages fail to process #19

Closed skylord123 closed 6 years ago

skylord123 commented 6 years ago

I've been processing several thousands of PDFs using gcv2hocr and have been slowly fixing issues as they arise. I finally hit an issue that I could use some help fixing.

I have a huge PDF that has been failing on a blank page. The google vision response is empty (because it's a blank page) which causes gcv2hocr to hang on the page (doesn't error out as far as I can tell because my application that is calling it hits the timeout for running gcv2hocr which I currently have set at 2 minutes).

The problem I have is what should I do with blank pages? How should the HOCR be generated? I use hocr2pdf to later combine the HOCR and jpeg files into a new PDF and want to make sure the blank page remains in the new PDF. Do I generate an HOCR file with no content?

Once I have advice on what direction to take I can create the PR to fix it.

Here is the blank page's google vision response: page_011.json

{"responses":[{}]}

Here is the blank page jpg: page_011

skylord123 commented 6 years ago

I have the script compiled to exe form because our system admin didn't want to install python on the server (and I am trying to push him to go with linux but until then..). When I run the exe form of the program I get this:

image

When I run as a python script on the same image I get this:

Traceback (most recent call last):
  File ".\gcv2hocr.py", line 144, in <module>
    resp = json.load(instream)['responses'][0]
  File "C:\Users\Skylar\AppData\Local\Programs\Python\Python36-32\lib\json\__init__.py", line 296, in load
    return loads(fp.read(),
  File "C:\Users\Skylar\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 513: character maps to <undefined>

So it looks like it is having an issue because of the way gcv2hocr expects an item to be under the responses JSON array.

skylord123 commented 6 years ago

I think generating an HOCR file with no ocr_line elements would work (looks like hocr2pdf would handle this). Now just need to update the gcv2hocr script to do that..

dinosauria123 commented 6 years ago

Thank you for using gcv2hocr and make good patches.

My opinion is generate hocr file even google respond blank page json.

If you make a patch, I will submit it.

Sorry for inconvenience.

2018/07/31 6:27、Skylar Sadlier notifications@github.comのメール:

I think generating an HOCR file with no ocr_line elements would work (looks like hocr2pdf would handle this). Now just need to update the gcv2hocr script to do that..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.