dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

Fix missing bounding boxes X/Y for gcv2hocr #17

Closed skylord123 closed 6 years ago

skylord123 commented 6 years ago

If the bounding box X/Y is undefined then assume 0 (I believe google drops the key if value is zero for some reason).

Fixes #6

Found this stackoverflow question that answered it: https://stackoverflow.com/questions/39378862/incomplete-coordinate-values-for-google-vision-ocr

Example of missing bounding box keys (that made the code fail for me on several PDF files):


        {
          "locale": "en",
          "description": "<text removed>",
          "boundingPoly": {
            "vertices": [
              {
                "x": 194
              },
              {
                "x": 1333
              },
              {
                "x": 1333,
                "y": 1835
              },
              {
                "x": 194,
                "y": 1835
              }
            ]
          }
        },

This is the first item in textAnnotations which means it is the full page text.

dinosauria123 commented 6 years ago

Thank you for your patch. Good work ! I have to fix my C version gcv2hocr.....