ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
84 stars 58 forks source link

How to request a protected dataset from a script ? #92

Open gmaze opened 1 year ago

gmaze commented 1 year ago

Hi ! I recently ran over an issue I can't solve myself and therefore would like to ask here your feedback and/or help please

I maintain the argopy python library. It can be used to fetch Argo data from several sources (ftp, http, files) and in particular the Ifremer erddap instance. Everything goes very well (congratulation for your work, erddap is really a game changer to easily access data) as long as datasets are public.

But recently we came across a new user requirement that is to use argopy to access protected data. We therefore implemented an erddap server with the recommended ORCID authentification process. It works well using the web browser interface.

However, even if a user is logged in on the erddap and can see/access the protected data using a web browser, I cannot managed to access/request this data using the argopy library from a CLI script or even a Jupyter notebook running in the same web browser.

Do you have any idea on how to solve this issue please ?

ps: I'm not even sure that having argopy to be authenticated by ORCID would make the erddap server to allow requests to the protected dataset (https://github.com/euroargodev/argopy/issues/243).

ps: May be the issue is to know what are the http request header parameters required by the erddap server to consider the client request as authenticated

ps: I'm aware of the "https://coastwatch.pfeg.noaa.gov/erddap/download/AccessToPrivateDatasets.html" Scripts instructions. But it does not address the issue here (orcid login)

BobSimons commented 1 year ago

That is a challenging problem that affects a growing number of people.

See the "Scripts" section of https://coastwatch.pfeg.noaa.gov/erddap/download/AccessToPrivateDatasets.html It probably isn't directly applicable, but may give you a hint at how to solve the problem.

If it doesn't help, then one solution (already on the To Do list) is for us to add a feature to ERDDAP where a logged-in user can request a 24-hour (or user-specified duration?) temporary password, and where ERDDAP accepts this one time password when it is passed as a parameter from a script. The downside is that this is much less secure than OAuth authentication and so makes ERDDAP's protection of the data much less secure.

But I haven't kept up with how other software handles this problem. It is worth looking around for better solutions. I'll try to get Chris John involved.

gmaze commented 1 year ago

Thanks for your quick answer ! Indeed the temporary password solution would be much less secure than OAuth, and basically the point is for our library users to be able to run a data fetching script in bash mode

In a perfect world, we can imagine that the erddap server could have a registered user settings page where users could ask&manage secret keys Users could attribute a key to a specific program/client that aims to send request to the erddap. From the client library side (e.g. argopy) we would let users to provide this key and automatically add it to http requests to the erddap (as a x-param in the header for instance). The erddap server would then check for the validity of this key and let or block the request

but I'm afraid now that this is just paraphrasing your temporary password suggestion !

BobSimons commented 1 year ago

I'm not so keen on a settings page and having ERDDAP manage secrets for the long run. There are security advantages to having the password be valid for a short time rather than a long time. And there are security advantages if ERDDAP just has to keep secret info in memory and not store it to disk (for longer term use, and in case ERDDAP is restarted).

I'll add to your idea: the password could be tied to a specified IP address (not necessarily the computer the user is using to request the password). But I know that with some, e.g., Amazon setups, the script might run on multiple servers and you might not know the IP address of any of them.

gmaze commented 1 year ago

I think I understand your concern and design vision for ERDDAP

About attaching the IP address, indeed, this would prevent requests to be sent from the computing nodes of HPC or other cloud computing providers, or at least make this much more complicated

to be sure I understand your suggestion, the implied workflow would be:

  1. in a browser, user login to the erddap server using any possible erddap provided service (e.g. ORCID),
  2. in a browser, user visits some dedicated webpage where they can request for a temporary password (max duration 24h00),
  3. in a script, user provide the temporary password to argopy (using our option mechanism or method arguments)
  4. in a script, user send an argopy data fetching request to the erddap server, with argopy sending in the http request header the temporary password
  5. the erddap server check for the validity of the password, and whatever the login status of the user, will follow on processing the request if the password is valid.

If this work like this, this means that from the erddap server point of view, access to a dataset depends on either the logged user credential (trying to visit the protected dataset webpage) or the password validity (trying to get the protected dataset as downloadable format like json or netcdf)

rmendels commented 1 year ago

@gmaze This is not something I know a lot about, but I am interested in looking into it. Can you tell me what you are using at present to handle the ORCID ID and authentication within the Python program?

ChrisPJohn commented 1 year ago

I need to read more about ORCID and exactly how ERDDAP handles it. That said, I do think the access to private datasets page is a useful resource here. Mostly the general approach of needing of using curl (or some other strategy) to make requests to the ERDDAP server. The requests for ORCID will be different than that example (the example is for Google login). As mentioned on the access to private datasets page, a useful resource for understanding what requests will be required for ORCID authentication is monitoring the network tab of the developer's console while going through the log in flow on the web.

There is a potential feature request to better support scripting authenticated access. I need to investigate what that would entail and how complex those changes would be though.

rmendels commented 1 year ago

@ChrisPJohn @gmaze My experience with R suggests there is not a whole lot more that can be done in ERDDAP, though I may be wrong. The issue in a script is you need something that mimics logging into ORCID, storing the cookie, and then have a communication protocol that allows that cookie to be used in the request. R now has some packages that can do that (usually providing some way to mimic a login and a front-end to curl). I would imagine Python has that capability somewhere, I am just not certain which packages. ORCID I believe has an API that perhaps can be used for the first step (as well as a Python wrapper for that), would have to look up options on different Python libraries on how to include that cookie.

rmendels commented 1 year ago

@ChrisPJohn @gmaze For example the following package should allow you to get the ORCID programmatically:

https://github.com/ORCID/python-orcid

Then if any of the url packages like urlLib allow the header to be set, include that in the header. But of course since I haven't actually implemented it, it would be famous last words, and since I don't have an ORCID account I have no way of testing,

rmendels commented 1 year ago

@gmaze @ChrisPJohn see also:

https://orcid.github.io/orcid-api-tutorial/get/

gmaze commented 1 year ago

@gmaze This is not something I know a lot about, but I am interested in looking into it. Can you tell me what you are using at present to handle the ORCID ID and authentication within the Python program?

At the present, we don't have any authentication mechanism in argopy, it's being discussed here: https://github.com/euroargodev/argopy/issues/243

gmaze commented 1 year ago

There is a potential feature request to better support scripting authenticated access. I need to investigate what that would entail and how complex those changes would be though.

Surely, that would be great !

gmaze commented 1 year ago

@ChrisPJohn @gmaze For example the following package should allow you to get the ORCID programmatically:

https://github.com/ORCID/python-orcid

This package does not look supported anymore, it is not compatible with the last ORCID api version for instance, https://github.com/ORCID/python-orcid/issues/32 So I would not rely on it

gmaze commented 1 year ago

The issue in a script is you need something that mimics logging into ORCID, storing the cookie, and then have a communication protocol that allows that cookie to be used in the request.

Indeed, this looks like the key issue ! especially the 1st part (logging and storing cookie)...

Here is a small procedure that works on our test server and demonstrate how to do the 2nd part:

  1. Go to the erddap webpage and login with orcid
  2. Open the devtools and get the value of the cookie named JSESSIONID
  3. Now you can send a request to the erddap using this cookie:
    
    import aiohttp
    import pandas as pd

url = 'https://erddap-val.ifremer.fr/erddap/info/index.json' cookies = {'JSESSIONID': } async with aiohttp.ClientSession(cookies=cookies) as session: async with session.get(url) as resp: data = await resp.json() df = pd.DataFrame(data['table']['rows'], columns=data['table']['columnNames']) df = df[['Accessible', 'Dataset ID', 'Title']] df

Accessible | Dataset ID | Title
-- | -- | --
public | allDatasets | * The List of All Active Datasets in this ERDD...
yes | Argo-ref-ctd | CTD Reference Measurements
public | Argo-ref-ctd-public | CTD Reference Measurements

The request above will indeed return all the datasets on the server, including the protected one named "Argo-ref-ctd".

The same request with an empty cookie:
```python
import aiohttp

url = 'https://erddap-val.ifremer.fr/erddap/info/index.json'
cookies = {'JSESSIONID': None}
async with aiohttp.ClientSession(cookies=cookies) as session:
    async with session.get(url) as resp:
        data = await resp.json()
df = pd.DataFrame(data['table']['rows'], columns=data['table']['columnNames'])
df = df[['Accessible', 'Dataset ID', 'Title']]
df
Accessible Dataset ID Title
public allDatasets * The List of All Active Datasets in this ERDD...
public Argo-ref-ctd-public CTD Reference Measurements
rmendels commented 1 year ago

@gmaze Nice. Thanks for posting this.

gmaze commented 1 year ago

@rmendels is it ok if I put some of this content into a Discussion/Q&A post ? I now have also another code snippet to show how to retrieve protected data when the erddap server is using a simple login user/password protection (not OAUTH2 like above)

rmendels commented 1 year ago

@gmaze not quite certain that I understand what you are asking, but don't control the group either, but it would be great to get some of that content posted

gmaze commented 1 year ago

I mean that I think these code examples are not the solution to this "issue" and are more "quick and dirty" solutions that could fit into a FAQ, that's why I'd like to cc them in here: https://github.com/ERDDAP/erddap/discussions/categories/q-a

BobSimons commented 1 year ago

I think that in general we are encouraging using GitHub for programmer-related discussions and issues (e.g., bugs, new features) and are encouraging using the ERDDAP Google Group for end-user-related discussions. Certainly, there are far more users in the ERDDAP Google Group than here. Since this information is useful for users, maybe the appropriate place to post it is in the Google Group.

ChrisJohnNOAA commented 1 year ago

I think having more documentation/information in the GitHub repo is a good thing. I'd be happy for you to post your code examples in the Q&A section. If you were to send a message to the erddap users group, you could link to that post.