dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

gcv2hocr2.py - Support using vertices instead of normalizedVertices for bbox #40

Open SoloSynth1 opened 3 years ago

SoloSynth1 commented 3 years ago

Currently in gcv2hocr2.py, the coordinate of the bounding box for block, paragraph, and word is created from their respective boundingBox.normalizedVertices:

https://github.com/dinosauria123/gcv2hocr/blob/40adc1026fc10a0fbe746a0a26329d0e9bcd527a/gcv2hocr2.py#L123 https://github.com/dinosauria123/gcv2hocr/blob/40adc1026fc10a0fbe746a0a26329d0e9bcd527a/gcv2hocr2.py#L129 https://github.com/dinosauria123/gcv2hocr/blob/40adc1026fc10a0fbe746a0a26329d0e9bcd527a/gcv2hocr2.py#L135

Is it possible to create a new flag in argparse.ArgumentParser to enable the script to use boundingBox.vertices when creating the boxes, in case when boundingBox.normalizedVertices is not available? Thanks!

Edit: typo

UBISOFT-1 commented 2 years ago

@SoloSynth1 I have an error in regards to this

python gcv2hocr2.py --title "SuperOCR" --lang "ara" --savefile 1.hocr ~/LibraryOCR/program_work/c2296cb8aa9672b7d092e5d22e910948/final_images/1.jpeg.json 
Traceback (most recent call last):
  File "gcv2hocr2.py", line 191, in <module>
    page = fromResponse(resp, str(args.gcv_file.rsplit('.',1)[0]), **args.__dict__)
  File "gcv2hocr2.py", line 123, in fromResponse
    box = block_json['boundingBox']['normalizedVertices']
KeyError: 'normalizedVertices'

My JSON Response File is: https://gist.github.com/UBISOFT-1/167970005efe98c1058a99103d7d74cd

SoloSynth1 commented 2 years ago

@UBISOFT-1 Yes this is what I was talking about, if you want to fix this, a quick and dirty way is to change all normalizedVertices to vertices, so that:

box = word_json['boundingBox']['normalizedVertices'] 

becomes:

box = word_json['boundingBox']['vertices'] 
UBISOFT-1 commented 2 years ago

Yeah @SoloSynth1 I have already done that but here you can take a look at my hOCR file, whereby the box is giving weird values, for

$x0 $y0 $x1 $y1

Template Object

Take a look at this https://gist.github.com/UBISOFT-1/f71045efe2e33de1333f17c576acfb6b, I am currently working on a fix for this by using the vertices of the box correctly. The bbox is ludicrously high, is this expected or is there a quick fix for this.

Example

<span class='ocrx_word' id='word_0_0_1' title='bbox 2110000 441000 2268000 537600'>يتأتی</span>

Fix

By taking a look at the gcv2hocr.py file and asking a question on StackOverflow, I came to the following solution.

self.x0 = box[0][1]
self.y0 = box[0][1]
self.x1 = box[1][0]
self.y1 = box[2][1]

Edit: Typo :)

SoloSynth1 commented 2 years ago

@UBISOFT-1 Very interesting indeed... I noticed that in Line 13 of your gist that the image is only 2000 by 1400 pixels. Perhaps these ludicrous bbox values are caused by the GCV-->hOCR conversion?

OttomanZ commented 2 years ago

Hey @SoloSynth1 Check this Script out, that I updated, and works like a charm now. https://gist.github.com/OttomanZ/96814cfabf102d2001dfdd488ca18339

Here in the Lines 73 - 77

        try:
            self.x0 = int(float(box[0]['x'] if 'x' in box[0] and box[0]['x'] > 0 else 0))
            self.y0 = int(float(box[0]['y'] if 'y' in box[0] and box[0]['y'] > 0 else 0))
            self.x1 = int(float(box[2]['x'] if 'x' in box[2] and box[2]['x'] > 0 else 0))
            self.y1 = int(float(box[2]['y'] if 'y' in box[2] and box[2]['y'] > 0 else 0))

By removing the page width and height multiplication and adding the ['vertices'] object than NormalizedVertices makes this work like a charm.