Allow JupyterHub admins different cloud permissions than standard users

yuvipanda commented 2 years ago

Context

@rabernat brought up the point that it's important for hubs to be able to create cloud buckets whenever they want, without entirely having to rely on 2i2c. This can be accomplished by giving hub admin accounts a different set of cloud credentials when they're logged in to the hub than regular users - that way, we can scope it to just the extra perms they want (probably full GCS / S3 access) without having to give them full ownership on the cloud project.

Proposal

We already provide cloud credentials via workload identity in GCP and IRSA on AWS. This is matching a kubernetes SA to a GCP / AWS SA. We can have a different kubernetes service account for admins and thus grant that different rights

[x] Create a different KSA that is attached for hub admins
[ ] Write terraform config that optionally provisions an extra GCP Service Account for this specific account. This should be a superset of regular permissions
[ ] Optionally give extra rights to admins
[ ] Write documentation on how to create additional storage buckets

Updates and actions

No response

rabernat commented 2 years ago

How would the UI side of this work? Would they just run aws s3 commands from the terminal? I rely heavily on the aws / gcp console for this currently.

rabernat commented 2 years ago

Credentials for cloud storage use the cloud-provider IAM system. In my ideal world, credentials for these buckets would be automatically populate based on hub identity. However, since hub identity is different from cloud-provider identity, that's not trivial to do, and would require some kind of database matching hub users to projects to project storage buckets. The concept of "groups" in jupyterhub could be very helpful here. Developing a general solution to this problem as part of z2jh would have a huge impact.

yuvipanda commented 2 years ago

There are two separate parts here:

Different cloud credentials just for JupyterHub admins,
Different cloud credentials per-group

(1) is easier to do than (2) now, since we already have code that has special overrides for hub admins (that's how we do the shared dir). I want to focus this issue on (1).

And yes, any AWS command / tool should 'just work' - aws on the terminal would work with all the permissions granted.

scottyhq commented 2 years ago

Just wanted to chime in here to say this would be really useful! I can think of a couple cases that (might?) be relatively straightforward to implement before tackling group-based permissions.

admin creates a bucket without a lifecycle policy that everyone automatically has read-only access to (similar to current ~/shared folder)
admin modifies the base service account policy to add additional buckets everyone can access. for example, in AWS you have to explicitly list buckets that are in other accounts but "requestor pays". It seems many public datasets have the requestor pays configuration that would be nice to access in addition to the scratch bucket: https://registry.opendata.aws/usgs-landsat/

rabernat commented 2 years ago

As we begin the new semester, I am pinging this issue to remind the team that this is an extremely high-value feature that would really accelerate the use of data on our hubs.

yuvipanda commented 2 years ago

@rabernat ok, so to be more specific, we want to allow admins to create buckets, right? And implement that in a way that generalizes?

rabernat commented 2 years ago

Correct. This will empower the hub communities to manage their own cloud storage, rather than relying on 2i2c admins. Using object storage (rather than NFS mount) is key for more cloud-native-style workflows.

rabernat commented 1 year ago

I'm checking in on this issue. We continue to have requests from M2LInES and LEAP users to have a non-scratch bucket in which to store their data and share it with the hub team (but not the public).

yuvipanda commented 1 year ago

I've dealt with the specific issue here in https://github.com/2i2c-org/infrastructure/pull/1776 by making PERSISTENT_BUCKET a feature. That PR will enable that for LEAP and m2lines. How do we make sure that it doesn't baloon costs super high by users just unexpectedly leaving stuff there?

jmunroe commented 1 year ago

I am caught between two ways of solving this issue of creating cloud storage by admins on hubs.

1) It is "easy" to create buckets using the Google Console or the command like assuming you have the right permissions. We could set it up to with instructions to default to "requestor-pays" and the give guidance on how to set life cycle rules. It would solve the immediate problem of letting admins create what ever storage buckets they want. It think this would only be an option on a "dedicated" cluster where the community partner is paying (either directly or via 2i2c) the entire cloud costs. It would then be the community partner's responsibilty to manage the costs and life cycle rules associated with that cloud storage.

2) But I think that is not the "right" way to set it up (the way I would expect 2i2c cloud enginneers to create and manage cloud storage on a hub). I assume would modify the the correct terraform configuration files so that we are practicing infrastructure-as-code and other devops goodness. I see this as being important especially in cases where we are asked to migrate a hub to another availability zone, decommision a hub, or facility right to replicate: if the entire infrastructure is not managed we run a risk of "forgetting" some resource at some future point down the road. Is there potential to automate this process using an UI so that hub admins could deploy cloud storage in a managed way?

Are cloud buckets something that needs to be created/destroyed frequently? What is the true "cost" to having 2i2c create this resource on behalf of users?

Waiting for "I.T." to deploy some resource like extra storage was frustrating when I knew it was "easy" to do if I just had admin access to my own infrastructure is definitely something I wanted as a research user. But thinking it from a sustainability side, I am more hesistant to bypass any recommend cloud engineering best practices.

To be clear, it may be that for M2LInES and LEAP we just create the hubs for them so they can proceed with their work. My comments here are about the more general of what 2i2c is providing in a "research hub" and how that should be represented on our product roadmap.

yuvipanda commented 5 months ago

This is currently being done in https://github.com/2i2c-org/infrastructure/pull/3932 for AWS

2i2c-org / features