List of all the libraries that need to match on client, scheduler and worker

ericdill commented 4 years ago

The first time I tried to create a cluster, I got this helpful warning

As a not-expert here, it would be useful to know what all of the libraries are that need to match on the client, scheduler and worker(s). Additionally, it would also be helpful to list somewhere what those versions are in the daskdev/dask:latest docker image that are listed on coiledhq.com. Finally, you might consider pinning the specific image in the default clusters that you provide. That makes it feasible to understand when the dependencies are going to move and I can react accordingly

mrocklin commented 4 years ago

We're improving the error messaging a bit upstream. https://github.com/dask/distributed/issues/3767

My medium-term plan is to host conda environments from coiled. I would be curious to learn more about your thoughts on that UX. The intent here is something like ...

>>> CoiledCluster()
Warning: your software environment does not match.  Consider conda installing the remote environment and trying again

    conda env create -f https://dev.coiledhq.com/ericdill/default.conda

Thoughts?

ericdill commented 4 years ago

I would prefer a versioned metapackage that I can include in my environment spec. Something like conda install -c coiledhq ericdill-default=<version> That way I can effectively compose my runtime environment. In my past experience there were a few categories of libraries we needed to include in our analysis environments.

OSS libraries (numpy, pandas, scikit-learn, matplotlib),
general business-specifc wrappers around things like data access, auth, general compute wrappers to make it easier to work with spark / Dask
project specific libraries
Jupyter extensions and configuration libraries

Having a full environment installable from coiledhq is necessary but insufficient.

The way that I plan on solving this is to have a dtn-dask metapackage that gets installed as the only conda package for my dask workers and gets installed into the fat environment that I'll be using as my jupyter kernel.

package:
  name: dtn-dask
  version: 0.1
requirements:
  run:
    - dask 2.15
    - distributed 2.15
    - numpy 1.??
    - msgpack 1.0.0
    - lz4 3.0.2
    - python-blosc 1.9.1
    - cloudpickle 1.4.0

test:
  imports:
    - dask
    - distributed
    - numpy
    - msgpack
    - lz4
    - blosc
    - cloudpickle

about:
  ...

Then, my dask-workers get built as something simple like:

FROM continuumio/miniconda:some-version
RUN conda install -c dtn dtn-dask=0.1 && conda clean -ay

And my jupyter kernel gets specified as an environment.yaml plus a docker file for setting up some extra jupyter extensions, etc.

name: datascience-{{ version }}
channels:
  - dtn
  - conda-forge
dependencies:
  - python 3.8
  # distributed computing
  - dtn-dask 0.1
  # Data access
  - dtn.data
  - sqlalchemy
  - s3fs
  - awscli
  # Visualization
  - matplotlib
  - holoviews
  - panel
  - etc...
  # Might consider moving these into a dtn-notebook metapackage
  - notebook
  - jupyterlab
  - dask-labextension
  - jupyterlab-git
  - etc...

At least this is my intent. I've had success with this pattern in the past. We'll see if it works for this use case too. This environment spec will get stored in our git server, built on CI/CD, archived with conda-pack (or just zip / tar) and then dumped on s3 somewhere. Will also probably get turned into a docker image. Then the env can get deployed wherever pretty easily.

jennakwon06 commented 3 years ago

Hello - I am wondering if variations in Python is OK. I have a situation where client is 3.7.9 and scheduler/workers are at 3.7.7.

dantheman39 commented 3 years ago

Hi @jennakwon06 ! Since the difference is at the patch level (and not 3.7 vs. 3.8, for example), it will most likely be ok.

shughes-uk commented 1 year ago

Package sync resolves this fully.

coiled / feedback

List of all the libraries that need to match on client, scheduler and worker #10