The default User-Agent HTTP header of java.net.URLConnection is Java/1.8_162. Some hosting providers filter requests with such a User-Agent, e.g. Cloudflare. Below is an example exception.
{ Error: Error running static method
java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.london.gov.uk/press-releases/mayoral/londons-ai-start-ups-bid-for-cash-at-city-hall
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1944)
at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1939)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1938)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1508)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)
at org.icij.nodetika.NodeTika.createInputStream(NodeTika.java:77)
at org.icij.nodetika.NodeTika.extractText(NodeTika.java:350)
at org.icij.nodetika.NodeTika.extractText(NodeTika.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.london.gov.uk/press-releases/mayoral/londons-ai-start-ups-bid-for-cash-at-city-hall
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1894)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:3000)
at java.net.URLConnection.getContentType(URLConnection.java:512)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(HttpsURLConnectionImpl.java:415)
at org.icij.nodetika.NodeTika.createInputStream(NodeTika.java:74)
... 6 more
cause: nodeJava_java_io_IOException {} }
Setting the User-Agent to something less conspicuous (a regular browser agent) solved the issue for me. This patch set's the User-Agent for all requests to be a Firefox browser on Linux.
The default
User-Agent
HTTP header ofjava.net.URLConnection
isJava/1.8_162
. Some hosting providers filter requests with such aUser-Agent
, e.g. Cloudflare. Below is an example exception.Setting the
User-Agent
to something less conspicuous (a regular browser agent) solved the issue for me. This patch set's theUser-Agent
for all requests to be a Firefox browser on Linux.