INSPIRE-MIF / helpdesk-validator

Community discussion forum for INSPIRE validation issues
42 stars 23 forks source link

Deployment of INSPIRE Validator behind a corporate proxy in a docker environment #311

Closed ghost closed 3 years ago

ghost commented 4 years ago

Hi,

we are using v2020.1.2 branch in our environment to check metadata requirements. The test "md sds 3.4" fails because of the following error:

grafik

It seems that resource http://inspire.ec.europa.eu/metadata-codelist/SpatialDataServiceCategory/SpatialDataServiceCategory.en.xml has been redirected to https.

The same test passes in INSPIRE Validator production instance (resources cached?).

Any advice what could be the problem here?

Thanks in advance.


UPDATE 23.11.2020

Following aspects were discussed in this issue so far:

  1. This problem occured after comparing validation results between INSPIRE Validator v2020.1 and INSPIRE Validator v2020.1.2
  2. In versions before v2020.1 INSPIRE validator was very simple concernig the infrastructure (only etf-validator web application was necessary in order to have exactly the same validation results as in INSPIRE Validator). In the version v2020.1 schema caching solution was introduced in oder to make validation faster. Schema caching solution was provided together with etf web-app in one docker container.
  3. in v2020.1.2 INSPIRE validator became kind of more "heavyweight" bei introducing one more component bundled in the existing docker container (see diagram of INSPIRE Validator architecture):
    • reverse proxy (for handling the redirect to HTTPS in INSPIRE Registry)
  4. For deployment of v2020.1.2 and also current release (v2020.3) traditional way of deployment (only etf-validator web app) leads to differences in validation results because ETF-Validator can't deal with HTTP redirects due to security reasons. Therefore the docker is optimal way to deploy INSPIRE validator.
  5. Deployment of the current version on a localhost (non-production purposes for internal validation) via Docker is relatively simple but it is not the use case for most organizations
  6. Deployment of the current version behind a corporate proxy (production purposes for external and internal validation) via Docker seems to be very tricky due to proxy configuration which is different from organization to organization which ist the most common use case in the community

UPDATE 28.06.2021

Following aspects have been documented in

HTTPs caching

  1. There is a squid chaching mechanism in place which will cache http requests. But Squid ist not able to cache https-requests. As described above all the xsd file requests are requested via https and therefore they can't be cached by squid. This leads to hanging tests by mass validation including parallel execution because IP is blacklisted from *.europa.eu.
    • This means that the validator deployt without customizing (local installation) following the deployment instructions will have this problem and won't function properly

Schema caching solution unstable

  1. Redirection system to the INSPIRE registry is unstable. This problem also occurs by validator deployments without customization.
danielnavarrogeo commented 4 years ago

Dear @DeordD

You are right this resource is now redirected to https.

This issue has been fixed already with the last release. So if you are not using the latest docker package we recommend you to do so.

https://github.com/INSPIRE-MIF/helpdesk-validator/releases/download/v2020.1.2/inspire-validator-2020.1.2.zip

Please let us know if this solves the issue.

Regards

lglref32team2 commented 4 years ago

Dear @danielnavarrogeo, we are deploying the war from 2020.1.2. I made the afford of comparing (md5sum) the war files since v2020.1: the packaged war file is the same since then (even the modification date after unzipping is the same: Mar 18 09:02).

md5sum 2020.1/validator.war 3746c958d032c0e348ea90f17424114b 2020.1/validator.war md5sum 2020.1.1/validator.war 3746c958d032c0e348ea90f17424114b 2020.1.1/validator.war md5sum 2020.1.2/validator.war 3746c958d032c0e348ea90f17424114b 2020.1.2/validator.war

I redownloaded 2020.1 and 2020.1.2 again to exclude mistakes on my side.

Regards, Alex

carlospzurita commented 4 years ago

Dear @lglref32team2 The handling of redirections from HTTP to HTTPS for the INSPIRE registry is performed in the Docker image, but not in the .war file. We established in the Docker image a proxy to redirect the INSPIRE requests, given that it is not a good practice to let client applications to follow redirections.

We would recommend for now to either set up a proxy, use the pre-built image on the release, or build your own image using the Dockerfile and other resources that you can find on inspire-validator zipfile.

ghost commented 4 years ago

Hi @carlospzurita , @danielnavarrogeo ,

I will give my feedback on this after discussion in #319 is closed. It should be clear if redirect from HTTP to HTTPS for INSPIRE domain will remain stable and works respectively to TGs.

ghost commented 4 years ago

Hi @carlospzurita , @danielnavarrogeo

after comments from @MarcoMinghini here situation is much clearer now.

In order to have all posible options to solve this problem (some of them are stated here) - Is it possible to solve this problem (validation results different with and without proxy) only on ets-repository side without setting up a proxy?

I am aware this is not an optimal way to solve this, but it would be great to have complete picture.

Thanks in advance!

carlospzurita commented 4 years ago

Dear @DeordD

For now, there is no foreseen change in the ETS side for the HTTPS redirection. Given that most of the URLs are not pre-processed and all requests are handled using the underlying HTTPClient library, this would put an excessive complexity in the tests to add a check in every possible place that could contain an INSPIRE URI.

So the remaining options are to modify the services accordingly, setup the proxy in your own environment, or base your environment on the released Docker image.

hwbllmnn commented 4 years ago

I'm using a slightly modified version of the official docker image. When I request the file via curl http://localhost/metadata-codelist/SpatialDataServiceCategory/SpatialDataServiceCategory.en.xml from within the container I get:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
The proxy server could not handle the request<p>Reason: <strong>Error during SSL Handshake with remote server</strong></p><p />
<hr>
<address>Apache/2.4.10 (Debian) Server at localhost Port 80</address>
</body></html>

Is the official docker image not based on the included dockerfile?

carlospzurita commented 4 years ago

Dear @hwbllmnn

The docker image published in the production instance is based on the included Dockerfile (with slight modifications on deployment specifics options). The redirection is setup in the same way.

Please take into account that there some configurations and setup processes for both the Apache server acting as a proxy for redirection, and the service squid3 acting as a cache. The validator will be configured to use the port where squid is running to send it's request. Then, the cache will use it's own hosts file to send all requests to the INSPIRE domain to the redirection server running at localhost. The Apache server will be configured with a virtual domain to perform the redirection.

Take a look into the file on the res folder to check this configuration, and how they are used in the docker-entrypoint.sh

Of course, this is something that can be altered in any way you see fit, taking into account that you need to modify the etf-config.properties file inside the WAR to use the proper HTTP proxy that you may set up.

hwbllmnn commented 4 years ago

Hi @carlospzurita ,

I've played around with the proxy some more, especially the SSL options and arrived at the following configuration:

<VirtualHost *:*>
ServerAdmin carlospalma@guadaltel.com
ServerName inspire.ec.europa.eu
ErrorLog /var/log/apache2/inspire.ec.europa.eu-ssl-error_log
CustomLog /var/log/apache2/inspire.ec.europa.eu-ssl-access_log common

SSLProxyEngine On
SSLProxyCheckPeerName off
SSLProxyVerify none
SSLProxyCheckPeerCN off
ProxyPreserveHost On
DocumentRoot /var/www/html/

ProxyPass / https://inspire.ec.europa.eu/
ProxyPassReverse / https://inspire.ec.europa.eu/

</VirtualHost>

However, the server at inspire.ec.europa.eu delivers a 403 now:

root@86cc17f65c17:/var/lib/jetty# curl http://localhost/metadata-codelist/SpatialDataServiceCategory/SpatialDataServiceCategory.en.xml -D -
HTTP/1.1 403 Forbidden
Date: Wed, 12 Aug 2020 08:54:19 GMT
Server: Apache
X-Frame-Options: SAMEORIGIN
Content-Type: text/html; charset=iso-8859-1
Transfer-Encoding: chunked
Content-Language: en

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /metadata-codelist/SpatialDataServiceCategory/SpatialDataServiceCategory.en.xml
on this server.</p>
</body></html>

After enabling trace8 logging in the apache config, I can see that the request is properly made via SSL. But it's the server that actually sends the 403. I assume I'll have to set a header to send with the proxy request in order to get the proxy to work?

carlospzurita commented 4 years ago

Dear @hwbllmnn

Please take into account that we are also using squid3 as a service to cache the schemas, and that service is handling the requests to the INSPIRE registry. Have you checked the configuration being done in the docker-entrypoint.sh file?

cp /etc/hosts /etc/squid_hosts
echo "127.0.0.1 inspire.ec.europa.eu" >> /etc/squid_hosts
service apache2 start
a2enmod ssl
a2enmod proxy
a2enmod rewrite
a2enmod proxy_http
a2dissite 000-default
a2ensite proxy.conf 
service apache2 reload

rm -rf /var/spool/squid3/*
service squid3 start

We are using an additional hosts file to be used by squid, redirecting all calls to inspire.ec.europa.eu as added in the second line in the script above. We are keeping this separated from the /etc/hosts/ file to not mess with other operations from the operative system, and only perform this redirection for the validator. This is configured in the file squid.conf, where the caching is activated as well.

carlospzurita commented 4 years ago

Dear @hwbllmnn

Did you have any success applying the latest comments on this issue? If you need anything else, please let us know.

hwbllmnn commented 4 years ago

Hi @carlospzurita ,

unfortunately, no. We're using an unmodified version of the docker-entrypoint.sh, the only changes to the dockerfile are that we pre-download the scripts and add a few custom ones.

After that we're getting the SSL error from above. After applying the changes to the apache config from here we still get the 403.

carlospzurita commented 4 years ago

Please @hwbllmnn , is it possible for you to send us the current status of your installation? All the files involved: Dockerfile, entrypoint, server configurations... It may be better for us to setup everything on our premises and work in that deployment with full information and room to modify things.

hwbllmnn commented 4 years ago

Hi @carlospzurita ,

sure, here it comes:

setup.zip

You'll have to remove the two lines copying in extra scripts near the end of the Dockerfile as I probably cannot give those away freely (and they won't have anything to do with the proxy). Make sure you have the current validator.war next to the Dockerfile when building the image.

Note that I inlcuded the above changes to the apache proxy.conf, so you'll probably get the access denied error from above. Thank you for looking into this, it's much appreciated!

carlospzurita commented 3 years ago

Dear @hwbllmnn

We have been checking the files that you sent us. The docker image was built just fine, and the container was able to start without any issue. All the modifications that you made on the proxy.conf file only affects what it is running inside the container; that is, it has no relation on how the container communicates with the rest of the network the host machine is connected to.

The 403 that you are getting from the cURL happens because the setup on the container is not intended to be used as it is. If you don't update your /etc/hosts file, any request to localhost would not go through the Apache running under the virtual domain declared in the proxy.conf. But modifying this file has an effect on all the requests from inside this container, not only the ones from the validator.

To handle this setup, and also to not interfere with any other requests, we set an alternative hosts file on the first lines on the docker-entrypoint.sh

cp /etc/hosts /etc/squid_hosts
echo "127.0.0.1 inspire.ec.europa.eu" >> /etc/squid_hosts

This file is then used by squid, the caching system for the schemas. This is configured in the file squid.conf, in the section

#  TAG: hosts_file
#   Location of the host-local IP name-address associations
#   database. Most Operating Systems have such a file on different
#   default locations:
#   - Un*X & Linux:    /etc/hosts
#   - Windows NT/2000: %SystemRoot%\system32\drivers\etc\hosts
#              (%SystemRoot% value install default is c:\winnt)
#   - Windows XP/2003: %SystemRoot%\system32\drivers\etc\hosts
#              (%SystemRoot% value install default is c:\windows)
#   - Windows 9x/Me:   %windir%\hosts
#              (%windir% value is usually c:\windows)
#   - Cygwin:      /etc/hosts
#
#   The file contains newline-separated definitions, in the
#   form ip_address_in_dotted_form name [name ...] names are
#   whitespace-separated. Lines beginning with an hash (#)
#   character are comments.
#
#   The file is checked at startup and upon configuration.
#   If set to 'none', it won't be checked.
#   If append_domain is used, that domain will be added to
#   domain-local (i.e. not containing any dot character) host
#   definitions.
#Default:
hosts_file /etc/squid_hosts

Then, the validator is configured to use the HTTP port of squid, on this Dockerfile lines. It is important to note that this variables are referring to a host and port inside the container. No host machine or server is being referred here

# Activate HTTP proxy server by setting a host (IP or DNS name).
# Default: "none" for not using a proxy server
ENV HTTP_PROXY_HOST localhost
# HTTP proxy server port. Default 8080. If you are using Squid it is 3128
ENV HTTP_PROXY_PORT 3128

So any request to inspire.ec.europa.eu coming from the validator are sent to this port, where squid will send the request,recognizing by the alternative hosts file to be sent to 127.0.0.1, rerouted to the apache virtual domain to handle the redirection from HTTP to HTTPS, and then sent to the real INSPIRE domain.

In any case, all configurations inside the container won't have any effect on this particular issue, because it is something related to networking of the Docker installation. If you are still getting errors accessing the codelists through the validator, it may be related to a configuration issue of your Docker client. Please check the latest notes on the release, mainly the "Exposing the validator through a proxy" section. Here you would find an explanation on working around proxy issues.

If you need any more feedback or clarification, please contact us.

hwbllmnn commented 3 years ago

I'm not sure how this will help me. Since I need to use a corporate proxy, that one needs to be configured, so the proxy on localhost inside the container will not be used anyway?

Apart from that, I still get that 403 when requesting the Apache reverse proxy inside the container, so even if it would be used the codelists would not be available.

carlospzurita commented 3 years ago

The proxy on localhost inside the container will be used by the ETF to cache requests for external resources, and handle the redirection on the INSPIRE registry.

You need to configure you Docker client, that is, the installation of Docker in your machine, to use the corporate proxy and give access to the container. That is the resources on the release notes refer to, and you would need to apply on your configuration.

The 403 code from inside the container will persist always if you are using a tool as cURL to perform the requests. As explained in my last comment, there is a special configuration for the cache system (squid3) that is using an alternate hosts file. This hosts file changes the location of the domain inspire.ec.europa.eu to 127.0.0.1, where the internal Apache is running. In doing so, the virtual domain set on Apache will receive communications from squid3 and from the validator. Any other request pointing directly to localhost will not have any result, as the Apache server is not configured to run under that alias

I hope that this diagram may clarify this. The red arrow is the configuration bit that is explained in the section "Exposing the validator through a proxy" on the release page.

proxy(1)

hwbllmnn commented 3 years ago

Ok, thanks for the clarification. I didn't get the point that we HAVE to configure docker to use the proxy for all external traffic. That unfortunately seems not to be an not an option for us, though (we're running on kubernetes).

Thanks again for staying with us on this issue, it's much appreciated!

ghost commented 3 years ago

Hi @carlospzurita ,

could you please check my summary of discussion in this issue (intention was to do it in non-technical way that people can understand the challenge here):

  1. This problem occured after comparing validation results between INSPIRE Validator v2020.1 and INSPIRE Validator v2020.1.2
  2. In versions before v2020.1 INSPIRE validator was very simple concernig the infrastructure (only etf-validator web application was necessary in order to have exactly the same validation results as in INSPIRE Validator). In the version v2020.1 schema caching solution was introduced in oder to make validation faster. Schema caching solution was provided together with etf web-app in one docker container.
  3. in v2020.1.2 INSPIRE validator became kind of more "heavyweight" bei introducing one more component bundled in the existing docker container (see diagram of INSPIRE Validator architecture):
    • reverse proxy (for handling the redirect to HTTPS in INSPIRE Registry)
  4. For deployment of v2020.1.2 and also current release (v2020.3) traditional way of deployment (only etf-validator web app) leads to differences in validation results because ETF-Validator can't deal with HTTP redirects due to security reasons. Therefore the docker is optimal way to deploy INSPIRE validator.
  5. Deployment of the current version on a localhost (non-production purposes for internal validation) via Docker is relatively simple but it is not the use case for most organizations
  6. Deployment of the current version behind a corporate proxy (production purposes for external and internal validation) via Docker seems to be very tricky due to proxy configuration which is different from organization to organization which ist the most common use case in the community

Feel free to change my summary in any way needed.

Thanks a lot!

carlospzurita commented 3 years ago

Dear @DeordD

I think you have everything covered on your summary for this issue. One thing to point out is that the schema caching solution was already included in the Docker image of version 2020.1. But everything else is correct.

ghost commented 3 years ago

@carlospzurita I have changed it. Thanks a lot!

dperezBM commented 3 years ago

Dear all,

Thank you very much for your contributions, we hope everything is clear. Please, if you have any other questions or problems, do not hesitate to open another issue.

Best regards.

ghost commented 3 years ago

@carlospzurita Could you confirm my update from 28.06.2021?

ghost commented 3 years ago

Following issues have occured during 2021.2 deployment: