danielfrg / s3contents

Jupyter Notebooks in S3 - Jupyter Contents Manager implementation
Apache License 2.0
248 stars 88 forks source link

Do not expose S3 credentials to JupyterHub users #47

Open martinzugnoni opened 6 years ago

martinzugnoni commented 6 years ago

While configuring s3contents, we need to provide credentials to connect to S3 service. We can do it by directly writing into ~/.jupyter/jupyter_notebook_config.py config file, or reading from env variables.

In any of those cases, any logged in user can read the config file o review the exposed env variables using a terminal session.

I can't find a way to securely connect to S3 without exposing the credentials to the JupyterHub users.

I'm using the dockerspawner, with the official jupyterhub/singleuser image.

Any suggestions?

Thanks.

danielfrg commented 6 years ago

Definitely an issue. Not sure how to solve this on this side might be something you have to ask on the Jupyter side, if there is a way to disable logging of specific variables in the config.

martinzugnoni commented 6 years ago

I'm thinking about using a proxy to connect to S3, and provide unique access tokens for each Hub user. The idea is that the proxy evaluates the token, the logged in user and the action he/she's trying to perform, and determines if it's a valid action or not.

It's important to mention that I'm using a user-based prefixes strategy, where each user has its own namespace within the S3 bucket. The user should only have permission to read/write his/her namespace in the bucket.

I might work, but there's some implementations required on top of your s3contents app. ¯_(ツ)_/¯

danielfrg commented 6 years ago

That makes sense.

You can take a look at this issue that faces a similar problem: https://github.com/danielfrg/s3contents/issues/45 but it doesn't solve the credentials issue.

One solution might be to pass the credentials from a JupyterHub setting that the users have to input.

martinzugnoni commented 6 years ago

I wrote that issue. 😂

danielfrg commented 6 years ago

ROFL :)

rgbkrk commented 6 years ago

There is always setting up an IAM role for the host. That does mean they have access from a server standpoint (which means they can fetch whatever is allowed by that role), but at least doesn't expose the tokens directly in config.

martinzugnoni commented 6 years ago

Yes, we thought about IAM roles as well. But, as you say, the correct way would be to have a different user in AWS for each user in your system. Which is not possible if you have a ton of users. That's why we discarded that option.

milutz commented 5 years ago

@martinzugnoni Is this issue still a problem for you? I have a few different work arounds that I could type-up if they would help you.

I have one block of JupyterHub config that drives AWS to dynamically create a new IAM User for each user on jupyterHub (creates them at first login if they don't already exist - either way, it pulls the IAM User keys each login (this does make it really hard to use the keys for anything else) - it should work for up to 4999 users (one IAM User is consumed by jupyterhub to do the work)

I also have a block starts out the same but then generates temporary keys off of the IAM User keys (so these keys expire and become harmless - in exchange for having a maximum time the users session can run - depending on the style chosen, the max time on the temp keys is 12 or 36 hours (min time 10 minutes))

The above could also be switched to do federated users - which has a basically unlimited number of users, but does force a 12 hour max on the keys

Assuming you still have an issue, let me know which constraint is more important to you and I'll see if I can make some sample code

guimou commented 5 years ago

@martinzugnoni We have a different approach that may be useful for you or other people dealing with this problem. All our installation is based on OpenShift, but that must work at least in any Kubernetes environment, and other environments as well.

We have a central datalake based on Ceph. Each user has its own S3 credentials and a set of buckets he has access too (instead of one bucket and prefixes). We store all users information (credentials) in a Hashicorp Vault instance.

Users authenticate themselves on JupyterHub through OAuth using Keycloak. We store access information, so basically JWT access_token and refresh_token (encrypted), in the JupyterHub database (tokens are refreshed periodically).

When a user launches a Notebook, we use the pre_spawn_start function from JupyterHub to connect to Vault using the access_token and retrieve user's aws_key and secret. We have a special dynamic policy in Vault attached to the path where we store secrets (/some_path/user_id/secret) that allows each user to retrieve his secrets and only his (reason why we need a valid access token from Keycloak to enforce this policy).

Then we simply inject the secrets as env vars in the notebook and use S3Contents to connect. No problem if the user sees them as they are his! In fact it's even a bit more convoluted as we use HybridContentsManager to also connect the local filesystem, as well as connect all the different accessible buckets at a different path. Notebook code is here if you want to have a look: https://github.com/guimou/jupyter-notebooks-s3/blob/0e0979ae3fdb303e84a58ac85b6ec99357457ee6/minimal-notebook/jupyter_notebook_config.py#L28 but I will release in the coming days a clean version, along with Jupyterhub, Keycloak, Vault configurations and an article to explain everything in details. I'll update this comment when it's ready.

guimou commented 5 years ago

@martinzugnoni Here is the article I published with more details on our implementation. The repos with everything are now there for JupyterHub and there for the notebooks. A huge thanks to @danielfrg for his work, if we have the chance to meet the beer's on me!

chenglinzhang commented 4 years ago

Thanks @guimou for the Medium article. A question if you don't mind: in the section ####################### # Directories mapping # ####################### of jupyter_notebook_config.py, how to specify certificate file location, in order for S3ContentsManager to access S3 in HTTPS? I looked up the properties of c.S3ContentsManager. and c.NotebookApp. and do not have a clue.

nlhnt commented 1 year ago

Is this still an issue? I've tried looking for the jupyter_notebook_config.py in my baremetal (two node) k3s cluster from the view of a regular user and I can't seem to find it in ~/.jupyter dir.
Is this only an issue for dockerspawner? Does this apply to Kubespawner?