2i2c-org / docs

Documentation for 2i2c community JupyterHubs.
https://docs.2i2c.org
9 stars 17 forks source link

How to work with virtual environments (installing different kernels) [Openscapes] #81

Closed betolink closed 3 years ago

betolink commented 3 years ago

Is there a way to have multiple kernels in my session if I'm not a hub administrator? I tried to install some packages from the terminal and it seems I have no sudo either, is there a particular reason why?

betolink commented 3 years ago

Looks like this works in user space so no sudo required.

conda activate {environment}
ipython kernel install --name "{environment}" 

However this doesn't seem like a persistent operation, when my instance was restarted the kernel was not there.

consideRatio commented 3 years ago

I think by adding this section to the Dockerfile, conda will default to creating environments in the home folder that is persistent.

# Configure conda/mamba to create new environments within the home folder by
# default. This allows the environments to remain in between restarts of the
# container if only the home folder is persisted.
RUN conda config --system --prepend envs_dirs '~/.conda/envs'

Example from the hub.jupytearth.org's Dockerfile for the user environment image.

@2i2c-org/2i2c-team is this perhaps sensible to add for the pilot hubs default Dockerfile?

damianavila commented 3 years ago

@2i2c-org/2i2c-team is this perhaps sensible to add for the pilot hubs default Dockerfile?

IMHO, that should be something users should decide upon. There are several reproducibility/replicability workflows where starting from a fresh environment that is "codified" in some image/dockerfile somehow helps that others can do the same as you did... In fact, I would be surprised to see some environments persisted by default after restarting my pod :wink:

betolink commented 3 years ago

I think some Pangeo deployments let you pick the user base image? (with Openscapes we only pick the EC2 instance type). Maybe something like that would be useful. A hub administrator could add different repos for different environments.

At the moment OpenScapes is mainly working on https://github.com/NASA-Openscapes/earthdata-cloud-cookbook which requires some initial prototyping. Environment persistence between restarts would be handy to have until we are in "production mode"

damianavila commented 3 years ago

Your Hub administrator should be able to set up a customized environment: https://pilot.2i2c.org/en/latest/admin/howto/environment.html. If the environment persistence is useful/needed in your use case, a custom Dockerfile adding the lines @consideRatio suggested should be enough to support it, IMHO. Btw, we are in fact testing some new tooling to allow admins to self-serve the creation of the environment they are going to put in front of the users so they do not need to build it by themselves, just configure it.

betolink commented 3 years ago

I guess I need to find out who is our hub admin and see if we can get the Dockerfile approach + persistence. One thing I noticed from the documentation is that you discourage the use of quay.io/my-user/my-image:latest and for prototyping I was precisely thinking about having something like that (so if I modify the environment a hub admin doesn't have to update the build tag).

damianavila commented 3 years ago

One thing I noticed from the documentation is that you discourage the use of quay.io/my-user/my-image:latest

Yes, having specific references (tags) is important to really know the environment are you working with.

and for prototyping I was precisely thinking about having something like that (so if I modify the environment a hub admin doesn't have to update the build tag).

As I said before, we are currently testing some new tooling to prototype/test and eventually self-serve the environment customization. Currently, the process looks like this: https://github.com/2i2c-org/peddie-image

Would you be interested to have something like this for openscapes?

betolink commented 3 years ago

Just read this tooling and looks like step 4 is what I wanted to avoid, since it requires a hub admin.

Open the Configurator for the peddie hub (you need to be logged in as an admin).

The important part would be to have an agile way of altering the environment while we are prototyping. I think just persisting my home directory as @consideRatio suggested would be enough for now.

damianavila commented 3 years ago

@betolink, FYI, we are discussing the pros vs cons of shipping this by default. In the meantime, I encourage you to ping your hub admin so they can customize the image with the snippet @consideRatio shared above. In that way, we decouple the current technical discussion about this change from the customization you may need (that could be done by your hub admin without us being a blocker for your use case).

betolink commented 3 years ago

@damianavila, I just got admin credentials this morning and went to the "configurator" page. I see a box to enter a docker image name for the users and the default interface (RStudio, Lab or classic notebooks) but I don't see what image the users are running now. I don't want to disrupt what other users are doing by just entering my customized image. Is there a way to find out what image users are running now? so at least I can clone those dependencies and add the edit to persist the environment.

sgibson91 commented 3 years ago

Hi @betolink - you can see the image reference in these lines of the config file

https://github.com/2i2c-org/pilot-hubs/blob/a6f2e354399cc08275c16f49b0d92f75e11e6030/config/hubs/openscapes.cluster.yaml#L63-L65

betolink commented 3 years ago

Thanks @sgibson91! is 783616723547.dkr.ecr.us-west-2.amazonaws.com/user-image coming from https://github.com/2i2c-org/openscapes-image/? Oh I have so many questions and I don't want to spam you all.

I guess I could open an issue on the openscapes image repository to add what @consideRatio suggested. I assume there is a good reason why the image is being pushed to AWS ECR instead of Dockerhub.

sgibson91 commented 3 years ago

Thanks @sgibson91! is 783616723547.dkr.ecr.us-west-2.amazonaws.com/user-image coming from https://github.com/2i2c-org/openscapes-image/? Oh I have so many questions and I don't want to spam you all.

Yes is does look like that repository is the source of the image.

I assume there is a good reason why the image is being pushed to AWS ECR instead of Dockerhub.

I am not sure actually. Our default image repository is quay.io as that doesn't have the same rate limiting issues as DockerHub has.

betolink commented 3 years ago

One last thing (perhaps) I noticed a substantial performance hit when I installed a conda environment on my home directory. My guess is that this may be related to the home directory being mounted on EFS?

How to reproduce?

mamba env create -f environment.yml 

vs

mamba env create -f environment.yml -p /home/jovyan/{environment}
choldgraf commented 3 years ago

Hey all - just wanted to boost this comment as well, which might be an interesting option for managing different conda environments from within Jupyter: https://github.com/2i2c-org/pilot-hubs/issues/562#issuecomment-891740990

betolink commented 3 years ago

nb-conda-kernels sounds like a good option. We would still need some form of persistence right? otherwise we'll have to install an environment every time we start our instance. I wonder, is there a way for Jupyter hubs to configure base images per user and not hub-wide? a bit like binder + user space persistence?

damianavila commented 3 years ago

My guess is that this may be related to the home directory being mounted on EFS?

Most likely that is the case, EFS is slow for this kind of conda things. So you have persistence at the cost of performance...

I wonder, is there a way for Jupyter hubs to configure base images per user and not hub-wide?

Not a per-user option, but maybe using different profiles pointing to different images that you, as a specialized user, can customize?

https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/customizing/user-environment.html#using-multiple-profiles-to-let-users-select-their-environment

I imagine your use-case as-is:

One X profile in addition to the base one. That X profile loads a docker image that is actually creating the environments you may need in the Dockerfile (and maybe installing nb_conda_kernels to manage them). In addition, that Dockerfile could contain all the customizations that Erik proposed so your conda envs are saved in /home (and persisted). The user who wants that experience would select that X profile and they will have all the environments predefined in the Dockerfile + all the new ones that are created "live" by the user and persisted at /home. If the user modifies one of the environments "coming" from the Dockerfile, they can "promote" the customization by just modifying the Dockerfile, pushing it, and using the Configurator to update the reference (you could even think about using a latest reference and the Configurator step would be not needed, although it is not recommended to use latest unless you have a real good reason for that 😜 ). If the user works with one of their /home-backed environment, that would be automatically persistent (at the EFS slowness cost) but that one could be "promoted" to the Dockerfile when the user is enough happy about it...

sgibson91 commented 3 years ago

I think this is another use case where bringing the JupyterHub and BinderHub helm charts closer together will provide a solution, as we will be able to provide workflows closer to what the persistent BinderHub helm chart does https://github.com/gesiscss/persistent_binderhub i.e. a user can create an environment on the fly from a repo using repo2docker and these environments are persisted

betolink commented 3 years ago

I think having something like you both described would simplify many workflows. A Hub admin would be responsible for infrastructure, (i.e. credentials, shared mounts, instance types). Researchers will build their environment from a github repo(using repo2docker or similar.) and select the instance type they want to run this environment on. I think just having the flexibility to bootstrap an environment like Binder will reduce the need for persisting changes to the base image, since we can make those changes in the original repository and presumably persistence will be used for just work in progress or sample data but not whole Conda environments.

jules32 commented 3 years ago

Hi 2i2c team, thanks for all the discussion here and in https://github.com/2i2c-org/pilot-hubs/issues/562. Does this sound like something 2i2c can support? @betolink and @amfriesz can start coordinating/preparing stuff on our end but we wanted to first confirm if this is something you'll be moving forward with, and if you know a rough timeline. @choldgraf I'm happy to chat about it too if you'd like

betolink commented 3 years ago

I think this issue can be closed. We ended up managing it at the custom base image level, another option for future deployments would be for the configuration to allow multiple user images (Jupyterhub profiles)