2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

Investigate conda-store to understand if / how it could be used in JupyterHub #786

Closed yuvipanda closed 2 years ago

yuvipanda commented 3 years ago

Description

For https://github.com/2i2c-org/meta/issues/252, https://github.com/2i2c-org/features/issues/3, https://github.com/2i2c-org/features/issues/6, and https://github.com/pangeo-data/jupyter-earth/issues/79, we want to spend some time investigating conda-store. We should have a deeper understanding of what it is and how it works, so we can figure out where (and if) it can be helpfully deployed for us. It's also important to see if we can contribute back upstream productively, as community ownership & governance is very important for long term sustainability of projects.

Yuvi will try to write a TLJH plugin to get conda-store to work with TLJH. The goal of this exercise is to learn more about conda-store and how it might be used in a JupyterHub. This helps him understand what it takes to deploy the simplest possible production JupyterHub with a working conda-store implementation, and might independently be useful for TLJH users as well.

Value / benefit

Getting conda-store to work with tljh helps me fill out https://hackmd.io/_hQdrilJQFCGExUqYgwY3Q?edit with a quick evaluation of conda-store, to help understand how it can fit our use cases. It also helps figure out how upstream collaboration would work, as we'll need to send patches upstream for sure - https://github.com/Quansight/conda-store/pull/196 is a start.

Ultimately, this will help us figure out how much resources we can spend on improving and using conda-store.

Tasks to complete

Updates

costrouc commented 3 years ago

@yuvipanda please let me know how I can help and feel free to open issues on conda-store regarding the integration. Also happy to meet over a call sometime if you would like.

dharhas commented 3 years ago

community ownership & governance is very important for long term sustainability of projects

This is missing from conda-store right now but is the direction we want to go.

costrouc commented 3 years ago

@yuvipanda I commented on the hackmd as well.

costrouc commented 3 years ago

Additionally @yuvipanda I've completed the systemd example of running conda-store which might help you with tljh pluggin. See https://github.com/Quansight/conda-store/tree/main/examples/ubuntu2004. It has the systemd configuration files along with showing a minimal setup.

yuvipanda commented 3 years ago

Thanks a lot for the merge and the new release, @costrouc! I'll continue to open issues and PRs as I go along :)

And thanks for responding on the hackmd too! I'd love for you to transition to FastAPI (or another async framework) - I think from sync to async, especially in a process involving db transactions, becomes difficult after it reaches critical mass. And currently, almost all the server side software in this space (dask-gateway, jupyter, etc) are async, so it would be great to follow suit - helps with code re-use too wherever possible.

yuvipanda commented 3 years ago

So, the hub environment in TLJH uses a virtualenv, and conda isn't really installable in anything other than conda environments. The setup.py doesn't actually work in a fresh python environment due to https://github.com/conda/conda/issues/10691. I fell into that rabbit hole a tiny bit (see https://github.com/conda/conda/pull/11014) but have pulled back. I think conda-store can't run in the same environment as the hub as that is a virtualenv, but we can create a new conda env for conda-store to run out of instead.

damianavila commented 3 years ago

I think conda-store can't run in the same environment as the hub as that is a virtualenv, but we can create a new conda env for conda-store to run out of instead.

How about installing miniconda as part of the bootstrapping TLJH is doing? Then you can create the conda env and install conda store on it (modulo conda-store does not need to run from the base environment which would be already available after installing miniconda, so it should not be an issue, I think).

yuvipanda commented 3 years ago

@damianavila that's what I ended up doing https://github.com/yuvipanda/tljh-conda-store/blob/0ea1a4ad6018447f995deb02b571f740dffaee40/tljh_conda_store/__init__.py#L40

damianavila commented 3 years ago

So, mamba is there, nice! Btw, the linked code makes sense to me.

yuvipanda commented 2 years ago

Deeply tied into conda

As is clear from the name, conda-store is deeply tied into conda - it's even part of all the database schemas. This is its core value proposition - being tied into conda means it can provide perfectly reproducible environments by taking advantage of how conda works (deeply inspired by the Nix ecosystem, of course). However, this is also a disadvantage - as far as I can tell, you can't really step outside of the conda ecosystem. Based on my experience helping set up environments in educational and some research spaces, this has two main issues:

But maybe I'm looking at this the wrong way totally, and an analogy to repo2docker (which builds docker images, rather than environments) isn't quite right. But given that is my baseline, I think the lack of support for things that aren't just in conda is a serious limitation. This isn't a dig at the folks doing wonderful work in the R ecosystem for conda, but a reflection of current observed preferences of R users.

Users can make as many conda envs as they want!

conda-store ships with its own concepts of namespaces (users), and environments stored in a db. So individual users can create environments, and provide appropriate permissions to other users on the hub to use them. This is very helpful when you have one big hub that is used by a lot of fairly advanced users doing their own thing - as you might in an enterprise organization. This is my favorite feature, but it also scares me - it's versioning and storing environment definitions in a database, where I'd prefer to keep that in something like git. Either way, this is the exciting part of conda-store, and something I hope can be replicated in other tools that are more generic.

Many moving pieces

I tried to make the simplest possible setup for TLJH, but for production use outside I think the following separate processes will need to run?

  1. conda-store API server
  2. a postgresql database
  3. conda-workers for actually building the environments
  4. A message queue (like redis) for communication between the api server and the celery workers

I've run celery based setups with message queues in the past, and they do work great. But I try to avoid running them wherever possible :D In binderhub, we simply use kubernetes directly with an idempotency property rather than celery for similar effect. I am always just a little bit worried about the extra complexity a messaging system brings...

Next steps

I think next step is to find a community that has pre-existing users who are interested in creating a lot of varied custom conda environments to share amongst themselves, but don't use R, and then possibly try rolling this out. I don't think any of our current communities fit the bill. So I think from the 2i2c side, now we just wait until someone else asks for this. Our current efforts are probably better put towards moving more folks into https://github.com/jupyterhub/repo2docker-action and perhaps https://github.com/yuvipanda/jupyterhub-configurator.

I'd also like to actually finish tljh-conda-store, so I can form more opinions from actually trying it out and using it.

choldgraf commented 2 years ago

Thanks for this update @yuvipanda - it sounds like next steps with conda-store need to wait for finding a community that has the right use-case for it. I'll close this one and we can open a new one to track new implementation when the time is right. If you'd rather keep this one open feel free to re-open!

damianavila commented 2 years ago

@yuvipanda, first of all, thanks for the summary! I think the next step section is a reasonable one.

Our current efforts are probably better put towards moving more folks into https://github.com/jupyterhub/repo2docker-action and perhaps https://github.com/yuvipanda/jupyterhub-configurator.

Makes sense to me. Although I would love to read more about your experiences with tljh-conda-store in the future!

In binderhub, we simply use kubernetes directly with an idempotency property rather than celery for similar effect.

Btw, do you have a link that I can take a look at about this piece? 😉