Open pauf opened 5 years ago
After some further experimentation, I think I've found the issue:
{
"description": "R&D",
"boundingPoly": {
"vertices": [
{
"x": 1307,
"y": 1130
},
{
"x": 1342,
"y": 1129
},
{
"x": 1342,
"y": 1141
},
{
"x": 1307,
"y": 1142
}
]
}
},
Doesn't work (Segfault)
{
"description": "RAD", <--------------------------- CHANGE
"boundingPoly": {
"vertices": [
{
"x": 1307,
"y": 1130
},
{
"x": 1342,
"y": 1129
},
{
"x": 1342,
"y": 1141
},
{
"x": 1307,
"y": 1142
}
]
}
},
Does work.
It would seem the C version of the code (I haven't checked Python implementation) doesn't like the ampersand character (&). As this is valid output from Google, it's probably worth looking at fixing this where possible.
Thank you for using gcv2hocr and found out the issue.
I will fix it, please wait for a while...
“&” has to replace to “&“ it has been implemented for single letter but this problem comes from conjectured word.
Thanks for the quick reply!
No problem, I found a solution in the meantime, which might help while we wait:
sed -i -e 's/&/&SEMICOLON/g' /path/to/json/file.json
Hello, @dinosauria123 @pauf
I have encountered the same issue and decided to make a patch. It should should work for any xml entity that need to be escaped.
Hope this is useful.
Hi @dinosauria123 and everybody, I have a issue with gcv2hocr nowadays it looks like Google has changed something... I've executed test.json with json of the project(gcv2hocr) and it's ok. But if I execute google OCR with test.jpg and send this json to gcv2hocr I get different hocr. The most important thing I saw is the field "lang" wasn't parsed and the letter are now numbers...It's like a codification mistake or something like this, but it's really difficult to handle.
I paste example with test.hocr and my test.hocr: `1. test.hocr of the project:
Thank you for your report. I will check json output but patches may be delay because now I am busy my job.
I have checked gcv2hocr but output seems to be fine. Did you use gcvocr.sh to get json output ? Please attach your json output to your comment.
First off, thanks for an awesome piece of software. For the most part, it works great!
For some reason, after converting many thousands of pages, I've come across this error for one page only:
gcv2hocr "/mydir/error1.json" "/mydir/test.hocr"
Response: "Segmentation fault"
Initially I wondered whether the JSON was too complex, or whether there was too much information leading to overflows, but looking at some of the other pages I've ran through the software this would certainly not appear to be the case.
Hope this helps.