dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

gcv2hocr segfaults when X or Y coodinate of the text is missing json file #6

Closed dinosauria123 closed 6 years ago

dinosauria123 commented 7 years ago

Google Cloud Vision OCR output json files sometimes lacks X or Y coordinate of the recognized text.

When you met segfaults, please check json file X or Y coordinate missing or not.

I will fix it later.

dinosauria123 commented 7 years ago

Here's an example of json.

       {
          "description": "|",
          "boundingPoly": {
            "vertices": [
              {},
              {
                "x": -1
              },
              {
                "x": -1,
                "y": -1
              },
              {
                "y": -1
              }
            ]
          }
        },
skylord123 commented 6 years ago

I have this same issue. I didn't have the problem with the couple first PDFs I tested but when I tried to run it through some production data I ran into the issue on several PDF files. After looking into it I don't think google sets the coordinate value if the value is 0 (they just remove it): https://stackoverflow.com/questions/39378862/incomplete-coordinate-values-for-google-vision-ocr

Because of this I will be submitting a PR to fix this (If the value is not set then assume 0).

skylord123 commented 6 years ago

For me this occurred on the first object in the textAnnotations array that contains the full page text.


        {
          "locale": "en",
          "description": "<text removed>",
          "boundingPoly": {
            "vertices": [
              {
                "x": 194
              },
              {
                "x": 1333
              },
              {
                "x": 1333,
                "y": 1835
              },
              {
                "x": 194,
                "y": 1835
              }
            ]
          }
        },