Network issues - What ports and protocol does Coiled need? #53

Closed matthiasdv closed 4 years ago

matthiasdv commented 4 years ago

Running the Coiled getting started example works fine on my local development machine. But when I attempt to run it from a Kubernetes cluster on Google Cloud Engine I run into the following error when trying to instantiate the client:

Obviously, the client on the GCE machine is having issues when trying to connect to the Coiled managed scheduler. What would be the first things to check? What ports should be opened for which protocols?

Any help is much appreciated

mrocklin commented 4 years ago

Thank you for raising an issue @matthiasdv .

This is a signal that your dask.distributed versions are out of sync. I've raised a small PR in Dask to improve the error message here:

I'm curious, how did you install coiled? I would have expected it to require a sufficiently recent version of dask/distributed.

Can I ask for the following:

import coiled, dask, distributed
jrbourbeau commented 4 years ago

Thanks for raising an issue @matthiasdv! The KeyError: 'pickle-protocol' error that's popping up looks like it's related to some recent changes in distributed. Can you check that you're using distributed >= 2.23.0 ?

mrocklin commented 4 years ago

I think that this is the problem:

matthiasdv commented 4 years ago

I tried reproducing the issue, but unfortunately it solved itself. I re-installed the Coiled library using:

!pip install coiled --force-reinstall --user

And it seems to work now. This is what I believe must have happened:

I struggled to reproduce this on the actual environment. But the versions on the Docker image are as follows:

distributed               2.18.0           py38h32f6830_0    conda-forge
dask                      2.18.1                     py_0    conda-forge

After applying a pip --forece-reinstall coiled

import coiled, dask, distributed


The missing piece of information are the versions during the failed state. I did notice that the --force-reinstall upgraded the distributed version:

Successfully installed Jinja2-2.11.2 MarkupSafe-1.1.1 aiobotocore-1.1.0 aiohttp-3.6.2 aioitertools-0.7.0 async-timeout-3.0.1 attrs-20.1.0 backcall-0.2.0 bokeh-2.2.0 botocore-1.17.44 chardet-3.0.4 click-7.1.2 cloudpickle-1.5.0 coiled-0.0.21 dask-2.24.0 decorator-4.4.2 **_distributed-2.24.0_** docutils-0.15.2 fsspec-0.8.0 heapdict-1.0.1 idna-2.10 ipython-7.17.0 ipython-genutils-0.2.0 jedi-0.17.2 jmespath-0.10.0 locket-0.2.0 msgpack-1.0.0 multidict-4.7.6 numpy-1.19.1 packaging-20.4 pandas-1.1.1 parso-0.7.1 partd-1.1.0 pexpect-4.8.0 pickleshare-0.7.5 pillow-7.2.0 prompt-toolkit-3.0.6 psutil-5.7.2 ptyprocess-0.6.0 pygments-2.6.1 pyparsing-2.4.7 python-dateutil-2.8.1 pytz-2020.1 pyyaml-5.3.1 s3fs-0.5.0 setuptools-49.6.0 six-1.15.0 sortedcontainers-2.2.2 tblib-1.7.0 toolz-0.10.0 tornado-6.0.4 traitlets-4.3.3 typing-extensions- urllib3-1.25.10 wcwidth-0.2.5 wrapt-1.12.1 yarl-1.5.1 zict-2.0.0

matthiasdv commented 4 years ago

@mrocklin Dask has a .get_versions(check=True) method that we use regularly to validate that both the scheduler and workers are running the desired versions of certain libraries. This is the first go-to in case of strange behaviour. Is there a way to use that in this particular situation?

matthiasdv commented 4 years ago

Not entirely there yet though:

/home/jovyan/.local/lib/python3.8/site-packages/distributed/ VersionMismatchWarning: Mismatched versions found

| Package     | client        | scheduler     | workers       |
| dask        | 2.24.0        | 2.23.0        | 2.23.0        |
| distributed | 2.24.0        | 2.23.0        | 2.23.0        |
| python      | | | |

it seems the install resolved to a higher version than the one running on the cluster.

mrocklin commented 4 years ago

I suspect that that version combination should be fine though

mrocklin commented 4 years ago

2.23 introduced a significant break in the protocol. As long as you're above that you should be ok.

matthiasdv commented 4 years ago

Ok, wonderful. Currently computing some aggregations on the New York Taxi dataset. All in all Coiled was fairly easy to get up and running during first use.

jrbourbeau commented 4 years ago

Glad to hear it : )