etf-validator / governance

ETF Steering Group and the Technical Committee documents
1 stars 2 forks source link

EntityResolver implementation to address Xerces issue in http->https redirection #116

Open jenriquesoriano opened 2 years ago

jenriquesoriano commented 2 years ago

Background and Motivation:

In the v2.1-RC of the ETF, it's included the library xercesimpl-2.12.1. This library (java) doesn't allow redirections if they use different protocols. Considering this, the main issue that motivates this EIP is caused by redirections from the protocol HTTP to HTTPS, because even if HTTPS mirrors HTTP, it's still a different protocol. As there is no way to disable this behaviour, it ends up appearing errors later on in the report because of it.

Proposed change

The proposed solution to this problem after some discussion about the possible alternatives, it's a change at ETF-level, implementing SAX Interface EntityResolver in Xerces classes handling URLs.

This interface would allow to particularize the resource resolution, in order to enable specific behaviours and solving the redirection issue by substituting the url to the redirected url before it is used.

The class EntityResolver would be similar to this:

public class RedirectEntityResolver implements EntityResolver {

public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {

URL obj = new URL(systemId);
HttpURLConnection conn = (HttpURLConnection) obj.openConnection();

int status = conn.getResponseCode();
if ((status != HttpURLConnection.HTTP_OK) &&
(status == HttpURLConnection.HTTP_MOVED_TEMP
|| status == HttpURLConnection.HTTP_MOVED_PERM
|| status == HttpURLConnection.HTTP_SEE_OTHER
// manage https redirection
)) {

String newUrl = conn.getHeaderField("Location");
conn = (HttpURLConnection) new URL(newUrl).openConnection();
}

return new InputSource(conn.getInputStream());

}

}

Then the entity resolver should be set in the specific xerces class implementing the Xerces.EntityResolver interface and register an instance with the SAX parser using the parser's setEntityResolver method. The entity resolver will check that the only change is the change of protocol from HTTP to HTTPS and that the URL remains the same. It will be configurable at application level for the download of schemas through the ETF config properties.

Alternatives

An alternative to this solution would be to generate a cache on server with a list of http locations, phisically downloading the file on the server. Then we could intercept any HTTP request and redirect to the cache on server to return the file phisically to the ETF. This solution would require to register all URLs in a reverse proxy element, creating a virtualhost for each of these URLs. Moreover as a disadvantage, it complicates the architecture and maintenance of the system. However, it would generate a cache for these files.

Funding

Funding is provided by JRC through the INSPIRE validator team.

Additional information

This error seems to be able to occur in other places where is not possible to redirect from one protocol to another. So in a future, it could be needed a more general approach to this problem to be able to solve any redirections not depending on how the library reacts to this kind of redirections.