Closed matthiasdv closed 4 years ago
Thank you for raising an issue @matthiasdv .
This is a signal that your dask.distributed versions are out of sync. I've raised a small PR in Dask to improve the error message here: https://github.com/dask/distributed/pull/4076
I'm curious, how did you install coiled? I would have expected it to require a sufficiently recent version of dask/distributed.
Can I ask for the following:
import coiled, dask, distributed
print(coiled.__version__)
print(dask.__version__)
print(distributed.__version__)
Thanks for raising an issue @matthiasdv! The KeyError: 'pickle-protocol'
error that's popping up looks like it's related to some recent changes in distributed
. Can you check that you're using distributed >= 2.23.0
?
I think that this is the problem: https://github.com/coiled/cloud/pull/735
I tried reproducing the issue, but unfortunately it solved itself. I re-installed the Coiled library using:
!pip install coiled --force-reinstall --user
And it seems to work now. This is what I believe must have happened:
!pip install coiled --user
pip install coiled
must have not properly resolved the required distributed versionI struggled to reproduce this on the actual environment. But the versions on the Docker image are as follows:
distributed 2.18.0 py38h32f6830_0 conda-forge
dask 2.18.1 py_0 conda-forge
After applying a pip --forece-reinstall coiled
import coiled, dask, distributed
print(coiled.__version__)
print(dask.__version__)
print(distributed.__version__)
0.0.21
2.24.0
2.24.0
The missing piece of information are the versions during the failed state. I did notice that the --force-reinstall
upgraded the distributed version:
Successfully installed Jinja2-2.11.2 MarkupSafe-1.1.1 aiobotocore-1.1.0 aiohttp-3.6.2 aioitertools-0.7.0 async-timeout-3.0.1 attrs-20.1.0 backcall-0.2.0 bokeh-2.2.0 botocore-1.17.44 chardet-3.0.4 click-7.1.2 cloudpickle-1.5.0 coiled-0.0.21 dask-2.24.0 decorator-4.4.2 **_distributed-2.24.0_** docutils-0.15.2 fsspec-0.8.0 heapdict-1.0.1 idna-2.10 ipython-7.17.0 ipython-genutils-0.2.0 jedi-0.17.2 jmespath-0.10.0 locket-0.2.0 msgpack-1.0.0 multidict-4.7.6 numpy-1.19.1 packaging-20.4 pandas-1.1.1 parso-0.7.1 partd-1.1.0 pexpect-4.8.0 pickleshare-0.7.5 pillow-7.2.0 prompt-toolkit-3.0.6 psutil-5.7.2 ptyprocess-0.6.0 pygments-2.6.1 pyparsing-2.4.7 python-dateutil-2.8.1 pytz-2020.1 pyyaml-5.3.1 s3fs-0.5.0 setuptools-49.6.0 six-1.15.0 sortedcontainers-2.2.2 tblib-1.7.0 toolz-0.10.0 tornado-6.0.4 traitlets-4.3.3 typing-extensions-3.7.4.3 urllib3-1.25.10 wcwidth-0.2.5 wrapt-1.12.1 yarl-1.5.1 zict-2.0.0
@mrocklin Dask has a .get_versions(check=True)
method that we use regularly to validate that both the scheduler and workers are running the desired versions of certain libraries. This is the first go-to in case of strange behaviour. Is there a way to use that in this particular situation?
Not entirely there yet though:
/home/jovyan/.local/lib/python3.8/site-packages/distributed/client.py:1138: VersionMismatchWarning: Mismatched versions found
+-------------+---------------+---------------+---------------+
| Package | client | scheduler | workers |
+-------------+---------------+---------------+---------------+
| dask | 2.24.0 | 2.23.0 | 2.23.0 |
| distributed | 2.24.0 | 2.23.0 | 2.23.0 |
| python | 3.8.3.final.0 | 3.8.5.final.0 | 3.8.5.final.0 |
+-------------+---------------+---------------+---------------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
it seems the install resolved to a higher version than the one running on the cluster.
I suspect that that version combination should be fine though
2.23 introduced a significant break in the protocol. As long as you're above that you should be ok.
Ok, wonderful. Currently computing some aggregations on the New York Taxi dataset. All in all Coiled was fairly easy to get up and running during first use.
Glad to hear it : )
Running the Coiled getting started example works fine on my local development machine. But when I attempt to run it from a Kubernetes cluster on Google Cloud Engine I run into the following error when trying to instantiate the client:
Inspecting the cluster logs in the dashboard I see the following stack trace from the scheduler:
Obviously, the client on the GCE machine is having issues when trying to connect to the Coiled managed scheduler. What would be the first things to check? What ports should be opened for which protocols?
Any help is much appreciated