ckan / datapusher

A standalone web service that pushes data files from a CKAN site resources into its DataStore
GNU Affero General Public License v3.0
77 stars 152 forks source link

datapusher problems when running CKAN behind cloudflare / other firewalls #153

Open terchris opened 6 years ago

terchris commented 6 years ago

Hi It seems to me that datapusher uses the dns name of the site to fetch a file that is uploaded.

Correct me if I'm wrong. But to my observation a file is

  1. first uploaded to the CKAN server.
  2. the datapusher fetches the file using the full domain name
  3. open the file and update the database with the file content

This works fine when the domain name points directly to the CKAN server. But in my case I have two security measures in front of the CKAN server.

The public IP of the domain points to cloudflare.com cloudflare in turn points to the IP address of a Web Application Firewall (Application gateway on Azure) which in turn point to the CKAN server.

So when a file is uploaded to the CKAN sever and datapusher tries to fetch it then the request goes to cloudflare. Cloudflare sees datapusher as a illegal activity and blocks it.

in datapusher.error.log I see: HTTPError: DataPusher received a bad HTTP response when trying to download the data file status=4 and in the html text received I see the message from cloudflare "The owner of this website has banned your access based on your browser's signature"

In the doc I see that the ckan.site_url = http://your.ckan.instance.com is used to tell datapusher where to find CKAN. The full domain name. I think that this parameter is used for other things as well and changing it will create other problems.

Is there a way to tell datapusher that it can find CKAN on a IP address ?

Regards Terje

terchris commented 6 years ago

A workaround is to put the host name in the hosts file.

ThrawnCA commented 6 years ago

This appears to be the same issue as #50 - CloudFlare basically thinks that the DataPusher is a bad bot.

The workaround of using the hosts file will bypass CloudFlare entirely, which doesn't work for us because our certificate is at CloudFlare and we get SSL validation failures without it.