UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

Proxy support #133

Closed mikegerber closed 3 years ago

mikegerber commented 3 years ago

When a HTTP proxy is needed, conversion from PAGE to ALTO is failing:

# ocrd-fileformat-transform -I OCR-D-GT-PAGE -O ALTO
14:36:13.086 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-GT-PAGE_00000024 (PHYS_0024)
java.net.ConnectException: Connection timed out (Connection timed out)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
        at java.base/java.net.Socket.connect(Socket.java:609)
        at java.base/java.net.Socket.connect(Socket.java:558)
        at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
        at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
        at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
        at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
        at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
        at java.base/java.net.URL.openStream(URL.java:1140)
        at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:53)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
Could not initialise ALTO XML writer
java.lang.NullPointerException
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
14:38:23.306 ERROR ocrd-fileformat-transform - Transformation exited with return value 0 but no file was written.

Unfortunately with the network setup here, this also is a long wait for a connection error because packets are simply dropped...

The preferred solution for me would be that ocr-fileformat would parse the somewhat standard http_proxy environment variable and passes the correct parameter to java:

java -Dhttp.proxyHost=http-proxy.sbb.spk-berlin.de -Dhttp.proxyPort=3128 [...other parameters...]
mikegerber commented 3 years ago

https://github.com/OCR-D/ocrd_fileformat/issues/29

mikegerber commented 3 years ago

Example http_proxy variable:

$ env | grep http_proxy
http_proxy=http://http-proxy.sbb.spk-berlin.de:3128/
mikegerber commented 3 years ago

I should mention that I know from PAGE Viewer that setting these command line parameters solves the issue, so I strongly suspect it will also fix it for this problem with PAGE Converter.

mikegerber commented 3 years ago

After some research I found "the Java enterprise solution ™️", i.e. setting another env variable:

export JAVA_TOOL_OPTIONS="-Dhttp.proxyHost=http-proxy.sbb.spk-berlin.de -Dhttp.proxyPort=3128"

So I'm closing this issue! 😀

stweil commented 3 years ago

Thanks for examining this issue. I think it would be good to document the solution here in the README and also on some suitable place for OCR-D. CC'ing @kba for his opinion.

Ideally the software would give a user friendly error message for connection errors and suggest typical solutions (or link to a page with such hints).

mikegerber commented 3 years ago

Thanks for examining this issue. I think it would be good to document the solution here in the README and also on some suitable place for OCR-D. CC'ing @kba for his opinion.

I agree. It's not the first time I looked for a solution for the "Java vs. HTTP proxy problem", and it's relatively hard to find documentation of that JAVA_TOOL_OPTIONS solution.

mikegerber commented 3 years ago

I've opened https://github.com/OCR-D/ocrd_fileformat/issues/32 to address our immediate need of having an offline conversion of PAGE → ALTO using ocrd_fileformat, but it is also conceivable to implement the solution here