allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

Envirionment variable to disable SSL certificate checks #62

Closed seanmacavaney closed 3 years ago

seanmacavaney commented 3 years ago

Is your feature request related to a problem? Please describe.

Sometimes downloadable content isn't available due to expired or otherwise invalid SSL certificates.

Describe the solution you'd like

Since we always check the md5 hash of downloaded content, in these cases it's /probably/ safe to disable the certificate check. We probably do not want this to be the default behavior, however, for the reason below. There are already several environment variables that control behaviors related to downloads. I propose adding a new one that disables certificate checks: IR_DATASETS_DL_CERT_CHECK=false. This would allow users to explicitly opt out of the SSL certificate check, accepting any potential risk.

This would work for both the python interface and CLI. For python, this value should be checked ahead of every download, so users can set this without restarting a notebook session, for instance.

I'm not sure what I'd like to do for tests. I guess the test could check this condition and yield a status other than PASS or FAIL for this (assuming md5 is correct).

Describe alternatives you've considered

We could also always ignore SSL certificates, since we're verifying the md5 hash anyway, but this seems unwise as potentially harmful content could be downloaded from an impersonator.

Additional context

Currently, some resources accessed from ir.nist.gov are not accessible, such as https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml. Manual workarounds are available, but this would offer an easier workaround going forward.

If this problem is going to be long-term for these files, we'll need to decide what to do about them. One option is to disable the certificate check for these files in particular. Another would be to find how to add the proper certificate (is there a python package that updates the list of certificates to match what browsers are shipped with?)

ganeshan007 commented 3 years ago

Hello @seanmacavaney. I'm working on a problem related to retrieval of articles related to COVID-19 and wanted to access this data through this library. What workarounds would you suggest?

seanmacavaney commented 3 years ago

Hi @ganeshan007,

The SSL certificate issues have been fixed by NIST, and the downloads now work for me. Are you still getting SSL errors when downloading the data from them? If so, can you post them here so I can further investigate?

The workaround is this. When it starts downloading, it will give a message like this (depends on the particular file):

[INFO] If you have a local copy of https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml, you can symlink it here to avoid downloading it again: ~/.ir_datasets/downloads/0307a37b6b9f1a5f233340a769d538ea

You can then manually download the file and move it to the location specified. So for instance, you could run:

curl https://ir.nist.gov/covidSubmit/data/topics-rnd5.xml > ~/.ir_datasets/downloads/0307a37b6b9f1a5f233340a769d538e
ganeshan007 commented 3 years ago

Thanks for suggesting the workaround @seanmacavaney ! I checked once again, and now the downloads are working fine for me.