ICIJ / node-tika

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
MIT License
138 stars 36 forks source link

Set a less conspicuous user agent and avoid unnecessary 403's. #24

Open critocrito opened 6 years ago

critocrito commented 6 years ago

The default User-Agent HTTP header of java.net.URLConnection is Java/1.8_162. Some hosting providers filter requests with such a User-Agent, e.g. Cloudflare. Below is an example exception.

{ Error: Error running static method                                                                                                                                                                                
java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.london.gov.uk/press-releases/mayoral/londons-ai-start-ups-bid-for-cash-at-city-hall                                               
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)                                                                                                                                    
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)                                                                                                            
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)                                                                                                     
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)                                                                                                                                    
        at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1944)                                                                                                                          
        at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1939)                                                                                                                          
        at java.security.AccessController.doPrivileged(Native Method)                                                                                                                                              
        at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1938)                                                                                                         
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1508)                                                                                                                 
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)                                                                                                                  
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263)                                 
        at org.icij.nodetika.NodeTika.createInputStream(NodeTika.java:77)                                                                                                      
        at org.icij.nodetika.NodeTika.extractText(NodeTika.java:350)                                                                             
        at org.icij.nodetika.NodeTika.extractText(NodeTika.java:308)                                                                                                   
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                                                                                                                                            
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)                                                                                                                 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)                                                                                        
        at java.lang.reflect.Method.invoke(Method.java:498)                                                                                                                                             
Caused by: java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.london.gov.uk/press-releases/mayoral/londons-ai-start-ups-bid-for-cash-at-city-hall
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1894)                                  
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)                                              
        at sun.net.www.protocol.http.HttpURLConnection.getHeaderField(HttpURLConnection.java:3000)                                   
        at java.net.URLConnection.getContentType(URLConnection.java:512)                                                                                                          
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getContentType(HttpsURLConnectionImpl.java:415)                                                                                                        
        at org.icij.nodetika.NodeTika.createInputStream(NodeTika.java:74)                                                                                                       
        ... 6 more                                                                                       
 cause: nodeJava_java_io_IOException {} }

Setting the User-Agent to something less conspicuous (a regular browser agent) solved the issue for me. This patch set's the User-Agent for all requests to be a Firefox browser on Linux.