dinosauria123 / gcv2hocr

gcv2hocr converts from Google Cloud Vision OCR output to hocr to make a searchable pdf.
99 stars 33 forks source link

Converting JSON to HOCR (Segmentation Fault) #21

Open pauf opened 5 years ago

pauf commented 5 years ago

First off, thanks for an awesome piece of software. For the most part, it works great!

For some reason, after converting many thousands of pages, I've come across this error for one page only:

gcv2hocr "/mydir/error1.json" "/mydir/test.hocr"

Response: "Segmentation fault"

Initially I wondered whether the JSON was too complex, or whether there was too much information leading to overflows, but looking at some of the other pages I've ran through the software this would certainly not appear to be the case.

Hope this helps.

pauf commented 5 years ago

After some further experimentation, I think I've found the issue:

    {
      "description": "R&D",
      "boundingPoly": {
        "vertices": [
          {
            "x": 1307,
            "y": 1130
          },
          {
            "x": 1342,
            "y": 1129
          },
          {
            "x": 1342,
            "y": 1141
          },
          {
            "x": 1307,
            "y": 1142
          }
        ]
      }
    },

Doesn't work (Segfault)

    {
      "description": "RAD", <--------------------------- CHANGE
      "boundingPoly": {
        "vertices": [
          {
            "x": 1307,
            "y": 1130
          },
          {
            "x": 1342,
            "y": 1129
          },
          {
            "x": 1342,
            "y": 1141
          },
          {
            "x": 1307,
            "y": 1142
          }
        ]
      }
    },

Does work.

It would seem the C version of the code (I haven't checked Python implementation) doesn't like the ampersand character (&). As this is valid output from Google, it's probably worth looking at fixing this where possible.

dinosauria123 commented 5 years ago

Thank you for using gcv2hocr and found out the issue.

I will fix it, please wait for a while...

“&” has to replace to “&amp“ it has been implemented for single letter but this problem comes from conjectured word.

pauf commented 5 years ago

Thanks for the quick reply!

No problem, I found a solution in the meantime, which might help while we wait:

sed -i -e 's/&/&ampSEMICOLON/g' /path/to/json/file.json

junior1q94 commented 5 years ago

Hello, @dinosauria123 @pauf

I have encountered the same issue and decided to make a patch. It should should work for any xml entity that need to be escaped.

Hope this is useful.

IAutil commented 4 years ago

Hi @dinosauria123 and everybody, I have a issue with gcv2hocr nowadays it looks like Google has changed something... I've executed test.json with json of the project(gcv2hocr) and it's ok. But if I execute google OCR with test.jpg and send this json to gcv2hocr I get different hocr. The most important thing I saw is the field "lang" wasn't parsed and the letter are now numbers...It's like a codification mistake or something like this, but it's really difficult to handle.

I paste example with test.hocr and my test.hocr: `1. test.hocr of the project:

O p t i c a l =============================================================== **2. test.hocr of new gcv execution:**
81 104 194 104 338 104 80 179 119 177 197 177 221 178
dinosauria123 commented 4 years ago

Thank you for your report. I will check json output but patches may be delay because now I am busy my job.

dinosauria123 commented 4 years ago

I have checked gcv2hocr but output seems to be fine. Did you use gcvocr.sh to get json output ? Please attach your json output to your comment.