GCV to HOCR or PAGE conversion not working

OmriPi commented 4 years ago

Hi @dinosauria123! This is the issue I posted on ocr-fileformat: https://github.com/UB-Mannheim/ocr-fileformat/issues/121 As per your request I'm opening the issue here, copying the text:

I have the JSON output of google vision OCR of a PDF (emphasis on PDF and not an image). I would like to create a searchable version of that PDF using the OCR results. I have tried using gcv2hocr but it doesn't seem to work on PDFs, or it has some other error, because the HOCR output I'm getting from it is basically just the metadata. I tried using ocr-fileformat on the same file, but once again I get only the metadata as a result. Trying to convert it to PAGE fails as well, with the result being some java lines indicating exceptions have occurred. Does ocr-fileformat supports GCV JSON generated from PDF?

The file I'm trying to run it on is the sample file from google: gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf

And the JSON is generated following this tutorial: https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python

If you could assist me or point me in the direction of how to solve it I would be very grateful, as I'm in an urgent need to solve this issue.

I have used google vision to get the JSON, I already have it. I am having a problem with using the gcv to HOCR transformer found in this package. When I use it on the JSON I got from google vision, I am getting an almost blank output, with only the metadata.

When I'm trying to convert it to PAGE instead I get this result:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:994) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:169) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:204) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) Exception in thread "main" java.lang.NullPointerException at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) So I'm looking to understand why are the gcv converters in this module not working for me, despite the fact that I have a perfectly viable gcv JSON. I can send you the JSON generated from gcv and you can try for yourself to convert it, if it helps.

Thanks in advance!

dinosauria123 commented 4 years ago

Thank you for using gcv2hocr.

Your problem seems to conversion of pdf to hocr. I don't have a plan to support conversion pdf to hocr for gcv2hocr. But this request is twice, I began to think support this conversion....

If you want to convert pdf to searchable, this script may help you.

https://github.com/mah-jp/pdf4search

OmriPi commented 4 years ago

Thank you @dinosauria123! You're probably one of very few people in the world who are familiar enough with gcv output by now to make sense of it and make it possible! It would be amazing if you could add PDF support to gcv2hocr! I think that would make gcv2hocr even more useful as I have a hunch that more people need to OCR PDFs rather than image files... and hopefully the required change is not so big. Please consider adding support, I can assist with anything if you decide to do it!

Thanks!

dinosauria123 / gcv2hocr

GCV to HOCR or PAGE conversion not working #33