Add documentation for setting environment variables (e.g. for installing extra packages)

I could be wrong, but this seems like a fairly important detail in some deployments. Creating custom OS images with Packer is obviously better but that's a fairly high bar for a lot of users, probably. I found that this is currently supported with the following caveats:

It cannot be configured in cloudprovider.yaml or environment variables on the client (i.e. DASK_CLOUDPROVIDER__GCP__ENV_VARS would be ignored)
If passed directly in code, it must be quoted. For example, GCPCluster(..., env_vars=dict(EXTRA_CONDA_PACKAGES="numba xarray")) will not work because the render code (here) does not quote values and since the docker run statement built is already being yaml-escaped via single quotes, the values must be double quoted. The variables must be passed like this: EXTRA_CONDA_PACKAGES="\"numba xarray\"", not EXTRA_CONDA_PACKAGES="'numba xarray'".

Documentation for new users on this would be very helpful. Allowing the env_vars to come from cloudprovider.yaml would also be nice.

Thanks for raising this @eric-czech.

If passed directly in code, it must be quoted.

The quoting problem you describe is definitely a bug. It would probably be best to add quotes in the render code.

It cannot be configured in cloudprovider.yaml ...

Env vars are currently not a cloud provider specific option, it is a generic option, so they are not possible to configure via the GCP config for example. I feel like we have a few options here:

Implement some generic config option like cloudprovider.env_vars. However setting this would set for all VM based cloud providers. But is that a problem because how many folks will be using more than one?
We could implement this in each cluster manager with a config option. This will result in a lot of code duplication.
We could choose not provide this as a config option.

Typically I consider these environment variables as a hack or workaround to allow folks to move fast and try things out. Given that these dependencies will be installed every time a worker is started it is definitely not the most optimal way to install dependencies.

I would expect that folks using Dask Cloudprovider regularly will move to using custom Docker images which contain all of their dependencies.

A best practice setup for GCP would be a custom Docker image (probably stored in GCR) which is prebaked into a custom machine image with packer.

but that's a fairly high bar for a lot of users, probably

From a technical point I had hoped that packer would be pretty easy for folks to pick up. But I appreciate it involves learning about concepts which are tangential to what you are actually trying to achieve, and therefore is a mental bar in terms of time investment.

I do wonder if we should expose packer's functionality through Dask Cloudprovider. Perhaps add an optimize kwarg or something which would bake an image with packer the first time you use it, then subsequent cluster instantiations would make use of the cached image.

Although to be clear packer is orthogonal to this conversation about dependencies. All packer gives us is faster cluster creation times.

Typically I consider these environment variables as a hack or workaround to allow folks to move fast and try things out. Given that these dependencies will be installed every time a worker is started it is definitely not the most optimal way to install dependencies.

👍 -- I had two thoughts on that:

Other than trying out Cloud Provider in general, I think it's helpful as a complement to the fact that a good number of dependencies in dask are optional. Most probably don't want every possible dependency a workflow could need in a VM image (e.g. hdf5 + tiledb + zarr -- you might need 0 or all of these). I think there is a good bit of value in being able to say something like "Can I switch from arrays to data frames at this point in a workflow efficiently?", add the pyarrow dependency transiently, answer that question, and then come back and decide if you want to bake that into the VM (a much slower process). I've gone through that process many times myself in the last year or so in learning xarray/dask. I'm assuming with little evidence that it's a common experience.
A consensus about our target users (biologists) that has arisen from our developer group is that even Docker alone has too much of a learning curve to be something we expect them to be able to use, so Packer is also a stretch. Hypothetically, we could try to support whatever VMs they'd need on all public cloud providers but there is always the chance they want to add a dependency to those cluster images we didn't think of originally, and they're unlikely to learn the process for doing that thoroughly.

Overall, I 100% agree that Packer is a better solution but I know from doing very similar things with Docker alone that those quick hooks to try new software are important since the iteration loop for the environment is a slow one.

That said though, maybe there is a better way to solve the dependency problem? A --preload module/script for custom initialization perhaps?

Overall, I 100% agree that Packer is a better solution but I know from doing very similar things with Docker alone that those quick hooks to try new software are important since the iteration loop for the environment is a slow one.

Absolutely. This is why the EXTRA_ environment variables exist in the Dask Docker image.

Alternatively, you can install dependencies on the fly. Although if you scale up after this point new workers will not have this dependency.

import os

client.run(lambda: os.system("pip install <package>"))

I guess you could do this in a preload script, and I think you should be able to do this today in dask-cloudprovider.

I've always wanted to see something in the scheduler where you could perhaps pass a function with the client which gets executed on all new workers. This way you wouldn't need to be in control of how the workers are created in order to inject your preload.

# Psuedo-code

def my_custom_preload():
    import os
    os.system("pip install <package>")

client.register_worker_preload_function(my_custom_preload)

That said though, maybe there is a better way to solve the dependency problem?

This is a large problem that the Dask community has been very keen to resolve, but it is a large undertaking. I guess this is why there are groups like Blazing, Coiled or Saturn popping up to try and provide this as a service.

... even Docker alone has too much of a learning curve ...

I previously worked in weather/climate sciences and totally sympathise with your point here. This is kind of what I was saying with it involves learning about concepts which are tangential to what you are actually trying to achieve. It's not that the technologies are too complex, but rather that they are too far off the critical path.

To be a scientist/researcher in 2020 there are a number of technologies that you need to master in order to do your work. Things like Python, Conda, Git, Bash, etc. All of these are necessary tools in order to do your craft. I personally feel like Docker has joined that list in the last few years and learning it is part of the cost of doing business. However, I also feel like as a community we should be making this list smaller, not larger.

So this is where things get tricky. We seem to be drawing the line at conda in terms of managing environments. However conda alone is not enough for large scale workloads. So it becomes our responsibility to make use of more complex (or just nested) packaging methods like Docker and Packer, but work to abstract this away from the user. I'm generally fine with this, but sometimes struggle to draw the line in the right place.

I guess all Docker and Packer are giving you here is caching. If we built a conda environment from scratch every time we started a worker then your cluster would be very slow to scale. Projects like repo2docker exist to try and create a simple bridge from conda to docker. Perhaps we should also create something like repo2machine which continues to the level that packer does.

Sorry for all the thinking out loud here, but I'm finding this conversation useful!

I think that the PipInstall worker plugin was added to master recently. https://github.com/dask/distributed/pull/3216

On Wed, Nov 18, 2020 at 5:01 AM Jacob Tomlinson notifications@github.com wrote:

Overall, I 100% agree that Packer is a better solution but I know from doing very similar things with Docker alone that those quick hooks to try new software are important since the iteration loop for the environment is a slow one.

Absolutely. This is why the EXTRA_ environment variables exist in the Dask Docker image.

Alternatively, you can install dependencies on the fly. Although if you scale up after this point new workers will not have this dependency.

import os client.run(lambda: os.system("pip install "))

That said though, maybe there is a better way to solve the dependency problem?

This is a large problem that the Dask community has been very keen to resolve, but it is a large undertaking. I guess this is why there are groups like Blazing https://blazingsql.com/notebooks, Coiled https://coiled.io/ or Saturn https://site.saturncloud.io/s/ popping up to try and provide this as a service.

... even Docker alone has too much of a learning curve ...

I previously worked in weather/climate sciences and totally sympathise with your point here. This is kind of what I was saying with it involves learning about concepts which are tangential to what you are actually trying to achieve. It's not that the technologies are too complex, but rather that they are too far off the critical path.

To be a scientist/researcher in 2020 there are a number of technologies that you need to master in order to do your work. Things like Python, Conda, Git, Bash, etc. All of these are necessary tools in order to do your craft. I personally feel like Docker has joined that list in the last few years and learning it is part of the cost of doing business. However, I also feel like as a community we should be making this list smaller, not larger.

So this is where things get tricky. We seem to be drawing the line at conda in terms of managing environments. However conda alone is not enough for large scale workloads. So it becomes our responsibility to make use of more complex (or just nested) packaging methods like Docker and Packer, but work to abstract this away from the user. I'm generally fine with this, but sometimes struggle to draw the line in the right place.

I guess all Docker and Packer are giving you here is caching. If we built a conda environment from scratch every time we started a worker then your cluster would be very slow to scale. Projects like repo2docker https://github.com/jupyterhub/repo2docker exist to try and create a simple bridge from conda to docker. Perhaps we should also create something like repo2machine which continues to the level that packer does.

Sorry for all the thinking out loud here, but I'm finding this conversation useful!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-cloudprovider/issues/169#issuecomment-729661835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBNSBH4KVVT33FZX4TSQPAS7ANCNFSM4TU7VHYA .

Things like Python, Conda, Git, Bash, etc. All of these are necessary tools in order to do your craft. I personally feel like Docker has joined that list in the last few years and learning it is part of the cost of doing business. However, I also feel like as a community we should be making this list smaller, not larger.

Nicely put. Smaller would be great but maybe the need for tools with layered complexity (e.g. numpy -> dask -> xarray) is inevitable to meet the maximum sophistication of a larger number of users. I always imagine that scientists collect at the ends of those spectrums though -- I wonder how true that is.

Projects like repo2docker exist to try and create a simple bridge from conda to docker. Perhaps we should also create something like repo2machine which continues to the level that packer does.

Oh interesting, had not seen that before. I like the caching metaphor. It still surprises me a bit that installation of compiled packages via conda can't be made to be faster than the download and execution of an entire containerized OS image that contains all those same packages. Have you used mamba much yet? I have not at all, but I'm curious if you see that as likely to close the gap.

Have you used mamba much yet?

A little, it seems nice. I think the performance increases here will help things, but my worry is about reproducibility. The times I've tried to dump out a conda environment and recreate it recently have failed because packages go missing from anaconda.org. Especially when developing on nighly builds.

Thing conversation reminds me of this meme.

It works on my machine docker meme

dask / dask-cloudprovider

Add documentation for setting environment variables (e.g. for installing extra packages) #169