dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

gcv2hocr doesn't rectify negative coordinates in GCV API response #39

Open SoloSynth1 opened 3 years ago

SoloSynth1 commented 3 years ago

According to the hOCR standard (Latest is v1.2 as of March 2021), the bbox property specifies uint to be used. That means all values must be unsigned. (http://kba.cloud/hocr-spec/1.2/#propdef-bbox)

However, the textAnnotation API response from GCV will provide negative coordinates for some out-of-bound boxes, such as the example below:

{
  "description": "2-3/4300/62",
  "boundingPoly": {
    "vertices": [
      {
        "x": 4727,
        "y": -1
      },
      {
        "x": 4927,
        "y": 0
      },
      {
        "x": 4927,
        "y": 44
      },
      {
        "x": 4727,
        "y": 43
      }
    ],
    "normalizedVertices": []
  },
  "mid": "",
  "locale": "",
  "score": 0,
  "confidence": 0,
  "topicality": 0,
  "locations": [],
  "properties": []
}

In the current gcv2hocr script, such case will be parsed into .hocr file without retification, resulting in lines like this:

<span class='ocr_line' id='line_1_2' title="bbox 4727 -2 4927 44 ; baseline 0 -5; x_size 89; x_descenders 20; x_ascenders 21"><span class='ocrx_word' id='word_1_2' title='bbox 4727 -2 4927 44 ; x_wconf 85' lang='eng' dir='ltr'>  2-3/4300/62  </span>

This is causing hocr-pdf to error when trying to parse this illegal ocr_line. While hocr-pdf seems to work just fine by altering the parsing regex rule, It would be great if the script can implement some form of retification on the negative values in order to adhere with the cureent hOCR standard, thanks!