ESGF / esgf-wget

Service API endpoint for simplified wget scripts
0 stars 4 forks source link

whitelist for projects #25

Open sashakames opened 4 years ago

sashakames commented 4 years ago

To integrate wget into ESGF where it can't yet support restricted projects (as the auth infrastructure isn't deployment-ready), we should whitelist particular projects that are unrestricted. The project can be detected either in a search parameter: project=X or more commonly, the first attribute in the dataset tuple. If any datasets are detected outside of the whitelist, we display an error message to the user stating that the dataset was requested cannot be processed through this service and to (1) redo the request for unrestricted data only, AND (2) create a request for restricted data at a non-LLNL site.

[CMIP6, cmip5, cmip3, input4MIPs, obs4MIPs, CREATE-IP, E3SM] are most of the projects to consider.

mauzey1 commented 4 years ago

@sashakames Should the whitelist be held in an external XML file like the Solr shard list?

sashakames commented 4 years ago

local_settings.py is good for the whitelist. The xml format is legacy from the java days

mauzey1 commented 4 years ago

@sashakames We have recently replaced the local_settings.py file with an INI configuration file. We could either have a variable in the config file that is a comma-separated list of projects, or we could have the whitelist stored in a JSON file.

sashakames commented 4 years ago

I'm fine with a .json file. Ini doesn't work well with lists (picky about formatting in my opinion), though ESGConfigParser might help if you can get passed the learning curve.

mauzey1 commented 4 years ago

esgf-wget currently uses environment variables for the config file path and Django secret key, INI for the initial settings, and XML for the Solr shards.

Maybe we could use JSON for all of these settings at some point. Not necessarily everything in one file but using JSON format for all config files.

sashakames commented 4 years ago

The solr shards from esg-search is a bit of a legacy list. Its fine to migrate to a different form, but we would need to ensure the new listis made up to date in the event another shard drops out. I consider .json to be the easiest machine-readable format for Python programmers but open to other opinions on the matter.

philipkershaw commented 4 years ago

Just offering some thoughts on this issue: I'm a wary of having a separate piece of metadata to the auth layer itself to indicate whether something is secured or not. I would not have this. Instead I would suggest the esgf-wget code verify by going to source: do a sample check on a download and see if it returns 401 Unauthorized. If so, you know to a good level of confidence that it is a secured dataset. If we make a separate list we run the risk that we have to maintain information about access control in two different places. These can get out of sync with one another eg. the policy for a dataset changes to be open but esgf-wget tells me I can't download it because it is not in the whitelist.

sashakames commented 4 years ago

@philipkershaw good suggestion. My suggestion was devised as a stop-gap for deployment at LLNL to reduce CMIP download errors that continue to plague the support list. We'll need to think more about how to handle the different download scenarios wrt restricted / unrestricted X anonymous / logged-in I'd advocate that any user can get a wget script without a token and it should work well for unrestricted data.

philipkershaw commented 4 years ago

I'd advocate that any user can get a wget script without a token and it should work well for unrestricted data.

Agreed! :)

mauzey1 commented 3 years ago

@sashakames I think we can close this issue since the project whitelist has already been implemented in esgf-wget.