boettiger-lab / earthdatalogin

Making public access of public data a bit easier
https://boettiger-lab.github.io/earthdatalogin/
Other
22 stars 4 forks source link

Encourage users to use own credentials with `edl_netrc()`? #13

Open ateucher opened 2 months ago

ateucher commented 2 months ago

Even though I have a ~/.netrc file with my credentials, I only just realized (admittedly because I didn't read the docs) that calling edl_netrc() with no arguments wasn't actually using that - it was using the default credentials.

I'm wondering if there are some tweaks we could make to edl_netrc() to encourage users to set their own username/password in the .netrc file, and make it smoother to do so, while still falling back to using the default values?

A few ideas:

What do you think?

cboettig commented 2 months ago

This is a great question. The simplest way to set defaults is to set env vars EARTHDATA_USER and EARTHDATA_PASSWORD env vars. https://github.com/boettiger-lab/earthdatalogin/blob/main/R/default_auth.R The nice thing about this choice is that it works across NASA's three different mechanisms for auth (basic auth with netrc, OAuth with bearer tokens, S3 auth requesting S3 tokens).

edl_netrc won't overwrite your ~/.netrc though (if it is doing so that's a bug!). CRAN forbids packages from writing directly to $HOME anyway, so it writes to the tools::R_user_data_dir() for the package anyway. Old versions of GDAL (< 3.7) don't support alterate netrc paths, so in that case it does temporarily link it's own netrc to the home dir, but only if ~/.netrc doesn't exist. It should never be overwriting a user's ~/.netrc that has been created in another way (obviously that would be bad).

I think credentials set via env vars is generally the right way to go, e.g. that mechanism usually plays nicely with automation (our current GitHub CI issues notwithstanding :-) ). though of course it's not ideal, secure credential management is a whole thing.

But maybe we could be more pushy about setting credentials on interactive use. I promise I don't have strong opinions about everything, but I'm not a superfan of NASA pushing credential-based access to public resources here. I understand they need an escape hatch mechanism because AWS charges them for egress (this appears to be a complete contrast to, say, NOAA's data on AWS, where presumably NOAA is not charged for egress and thus does not put this walled garden around it). But most users shouldn't need credentials to get started accessing public data and telling users to create a user name and password is a pretty non-trivial source of friction for new users, and a needless security risk (because there's some risk that users re-use passwords, and now NASA has then taken on the obligation of storing securely). This isn't a great way for NASA to handle it's need/desire to avoid large egress charges by creating security risks and friction to new/educational use cases needlessly. Does that make sense?

ateucher commented 2 months ago

Thanks Carl. I think you've honed in on it here:

But maybe we could be more pushy about setting credentials on interactive use.

Probably the best thing is to more strongly encourage users (and provide instructions) to set those environment variables in their .Renviron. I am more worried about people hardcoding their credentials in a script in a call to edl_netrc() than I am about overusing the default credentials.

Sorry, you're right - it's not overwriting my ~/.netrc, but it does overwrite the one in the user data dir... it would be nice if I had set that once it would use it on subsequent calls so if I don't have those env vars set, it will still use my credentials in the netrc in tools::R_user_data_dir(). And/or if I have ~/.netrc with earthdata credentials, it would be nice if it used that. Though maybe that is too complicated and we just (as you mention) really encourage the env var route.

But most users shouldn't need credentials to get started accessing public data and telling users to create a user name and password is a pretty non-trivial source of friction for new users, and a needless security risk (because there's some risk that users re-use passwords, and now NASA has then taken on the obligation of storing securely).

I agree with this 100%. I fought many battles in my old public service job trying to lower barriers to access government data.

The nice thing about this choice is that it works across NASA's three different mechanisms for auth (basic auth with netrc, OAuth with bearer tokens, S3 auth requesting S3 tokens).

Can you point me to any resources to learn more about these mechanisms? Do any of them rely on the env vars or are they simply a mechanism to populate the .netrc?

cboettig commented 2 months ago

Right, encouraging people to use .Renviron instead of writing credentials into R code is definitely a good idea. One option is simply to remove the username and password arguments from edl_netrc() entirely, allowing the user to only set these with environ variables or stick with the defaults. Seems like an unusual pattern, but maybe it is justified here? (maybe a weaker solution is to make them 'hidden' args, like .password instead?)

Other mechanisms are:

S3 vignette here: https://boettiger-lab.github.io/earthdatalogin/articles/non-egressed.html

they are kinda ignored in documentation on purpose though -- the s3 tokens only work on AWS machines. Even more weirdly, the tokens fail on AWS machines, but work on any other machine (perhaps can be solved one day). netrc is the most 'portable' option, it's semi-ridiculous to me to have code that you cannot copy-paste and have work both on or off the openscapes hub. (err, rant here https://boettiger-lab.github.io/earthdatalogin/articles/motivations.html)

ateucher commented 2 months ago

One option is simply to remove the username and password arguments from edl_netrc() entirely, allowing the user to only set these with environ variables or stick with the defaults.

We could add one more option to this scenario: if interactive AND env vars are empty/not present, show a prompt to enter them?

Thanks for the info about authentication, it is becoming clearer! One question about s3, as I am also thinking about storage policies in the hub. If we start to push users towards storing data in s3 buckets rather than in the home directory, is there benefit to using s3 when they are downloading data to their s3 bucket? I.e., is NASA s3 -> users' hub s3 better than downloading data via normal http to their s3 bucket? Probably not cost-wise at least, since egress via http is free? I don't know if this is a use case we need to spend too much time on right now (especially as we are encouraging cloud-native range requests over downloading huge data files anyway), but I am curious.

cboettig commented 2 months ago

We could add one more option to this scenario: if interactive AND env vars are empty/not present, show a prompt to enter them?

My difficulty with this design is that this is the default situation for most new users, meaning that users immediately hit the access wall, create credentials.

I think a natural analogy here is the remotes::install_github() (& pak, etc) and other functions from RStudio that bundle a default GITHUB_TOKEN. iirc, some of these functions also prompt users with one-time message to use their own token, while others don't, but they don't hit a prompt that prevents the command from running at all (and I'm guessing many users are happy to ignore the message). I think that this use case is similar (rate limiting, and pre-packaging a default credential) and that is a well tested and familiar-to-R-users paradigm for us here.

Re S3 -- such a good question. The S3 access protocol supports lots of different configurations, but NASA chooses to use it in a configuration that is almost uniquely restrictive. "Normally", one can access an S3 bucket from anywhere, but NASA chooses to configure things such that it rejects any requests coming from outside AWS us-west-2 range of IP addresses. S3 allows for tokens to automatically expire after a set time period. NASA uses this feature to issue 'short-lived' tokens that automatically expire after 1 hour. (This isn't truly NASA's fault -- AWS limits a bucket to having at most 5,000 tokens issued at any one time iirc, so obviously if the tokens didn't expire quickly they'd hit this limit!) Lastly, NASA chooses to make the S3 tokens specific to the DAAC (I think this is also a technical constraint around expiring tokens for buckets and not really NASA's fault). But these issues make these particular S3 tokens rather cumbersome, and don't reflect the way our S3 bucket is set up on the openscapes hub, which I gather uses S3 in a more classical and lower-friction way. I guess another way to say this is that S3 tokens are designed for private/authenticated data access, and don't really scale easily to public access due the 5K cap, and the work-around with expiring tokens seems to me more trouble than it is worth.

Performance wise, S3 doesn't offer any better access than https (modulo a few redirects in the case of earthdata). Capability-wise, it can sometimes do things you can't do on https, like request a list of files, mirror an entire folder, check md5sums, etc, but only if these abilities are granted.

ateucher commented 2 months ago

My difficulty with this design is that this is the default situation for most new users, meaning that users immediately hit the access wall, create credentials.

Yeah that is a good point. Maybe a good middle ground then is to alert users when the default username and password is used, and with that alert also encourage them to set their own in the env vars?

Also perhaps, if they have called edl_netrc() with the defaults and no env vars are present, check for the presence of a netrc file with earthdatalogin credentials in either of the two locations, and use that?

My vision of the workflow for some users would be:

  1. call edl_netrc() setting username and password explicitly, this saves the .netrc file to the earthdatalogin user data dir.
  2. subsequent calls without arguments (and no env vars set) uses that file

This would allow a user to set persistent credentials without touching env vars, which not everyone is familiar with.

Thanks for the extra info about S3!