GateNLP / gate-core

The GATE Embedded core API and GATE Developer application
GNU Lesser General Public License v3.0
75 stars 29 forks source link

Redirects across protocols aren't followed when loading documents #128

Closed greenwoodma closed 3 years ago

greenwoodma commented 3 years ago

If you try and load a document from a URL which includes a redirect chain which in turn crosses from http to https (or vice versa) then the document may fail to load as expected: you might get an exception or more likely the document will contain a redirect message.

This is due to the standard URL handling in Java not following redirects that include a change in protocol. This was reported as a bug in Java which was eventually closed as "won't fix" as they claimed it would open other security issues (see https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4620571). While this may be true in the general case, in the case of loading documents we think it is should be fairly safe to follows these redirects and not doing so is likely to be more confusing to users.

I'm intending to implement this as a util method that can resolve a URL by following redirect responses across protocol changes. This util method will then be used to resolve the sourceUrl param when creating instances of gate.corpora.DocumentImpl. The resolved URL will be placed into the standard document feature so other processes (such as document formats that reload the content directly) can use the final URL. The original URL will be saved into a new feature gate.OriginalURL in case it is required.