dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

dask-yarn job fails with dumps_msgpack ImportError #147

Closed ikerforce closed 3 years ago

ikerforce commented 3 years ago

The following script fails when running it con EMR and HDInsight.

import os
os.environ['ARROW_LIBHDFS_DIR'] = '/usr/hdp/4.1.4.0/'

from dask_yarn import YarnCluster
from dask.distributed import Client
import dask.dataframe as dd

env_path = 'hdfs:///conda_envs/dask_yarn.tar.gz'

cluster = YarnCluster(environment=env_path,
                      worker_vcores=2,
                      worker_memory="8GiB")

cluster.scale(1)

# if __name__ == '__main__':

client = Client(cluster)
path = 'hdfs:///samples/data_100K_dask_casted/data_100K_dask_casted/*'

# df = dd.read_csv('hdfs:///samples/test.csv')
df = dd.read_parquet(path, engine='pyarrow')

df.head().compute()

What happened:

The error is the following:

Traceback (most recent call last):
  File "dask_test.py", line 30, in <module>
    print(df.count().compute())
  File "/home/hadoop/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask/base.py", line 284, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/hadoop/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask/base.py", line 566, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/hadoop/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/client.py", line 2646, in get
    futures = self._graph_to_futures(
  File "/home/hadoop/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/client.py", line 2554, in _graph_to_futures
    dsk = dsk.__dask_distributed_pack__(self, keyset)
  File "/home/hadoop/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/dask/highlevelgraph.py", line 946, in __dask_distributed_pack__
    from distributed.protocol.core import dumps_msgpack
ImportError: cannot import name 'dumps_msgpack' from 'distributed.protocol.core' (/home/hadoop/miniconda3/envs/dask_yarn/lib/python3.8/site-packages/distributed/protocol/core.py)
Exception ignored in: <function YarnCluster.__del__ at 0x7f6584a2ac10>

What you expected to happen:

Correct execution of the code as in my local computer (without dask-yarn).

Anything else we need to know?: I was able to get around this error by checking your change-logs and realising that dumps_msgpack was removed in the last distributed version. However I followed the exact steps from the offical latest documentation, so I believe this should be corrected or a note posted advising to use distributed 2021.4.0 instead of the default 2021.4.1.

Environment:

# packages in environment at /home/hadoop/miniconda3/envs/dask_yarn:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
aiobotocore               1.3.0              pyhd8ed1ab_0    conda-forge
aiohttp                   3.7.4            py38h497a2fe_0    conda-forge
aioitertools              0.7.1              pyhd8ed1ab_0    conda-forge
async-timeout             3.0.1                   py_1000    conda-forge
attrs                     20.3.0             pyhd3deb0d_0    conda-forge
blas                      1.0                    openblas    anaconda
bokeh                     2.2.3                    py38_0    anaconda
boost-cpp                 1.74.0               hc6e9bd1_2    conda-forge
botocore                  1.20.49            pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py38h497a2fe_1001    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h7f98852_1    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
cffi                      1.14.5           py38ha65f79e_0    conda-forge
chardet                   4.0.0            py38h578d9bd_1    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cloudpickle               1.6.0                      py_0    conda-forge
conda-pack                0.6.0              pyhd3deb0d_0    conda-forge
cryptography              3.4.7            py38ha5dfef3_0    conda-forge
curl                      7.76.1               h979ede3_1    conda-forge
cytoolz                   0.11.0           py38h497a2fe_3    conda-forge
dask                      2021.4.0           pyhd3eb1b0_0  
dask-core                 2021.4.0           pyhd3eb1b0_0  
dask-yarn                 0.9              py38h578d9bd_0    conda-forge
distributed               2021.4.1         py38h578d9bd_0    conda-forge
freetype                  2.10.4               h5ab3b9f_0    anaconda
fsspec                    2021.4.0           pyhd8ed1ab_0    conda-forge
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
greenlet                  1.0.0            py38h709712a_0    conda-forge
grpcio                    1.37.0           py38hdd6454d_0    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
icu                       68.1                 h58526e2_0    conda-forge
idna                      3.1                pyhd3deb0d_0    conda-forge
jinja2                    2.11.2                     py_0    anaconda
jmespath                  0.10.0             pyh9f0ad1d_0    conda-forge
jpeg                      9b                   habf39ab_1    anaconda
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.11                 h396b838_0    anaconda
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libcurl                   7.76.1               hc4aaa36_1    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libgcrypt                 1.9.3                h7f98852_0    conda-forge
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libgomp                   9.3.0               h2828fa1_19    conda-forge
libgpg-error              1.42                 h9c3ff4c_0    conda-forge
libgsasl                  1.8.0                         2    conda-forge
libhdfs3                  2.3               hb485604_1015    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libntlm                   1.4               h7f98852_1002    conda-forge
libopenblas               0.3.10               h5a2b251_0    anaconda
libpng                    1.6.37               hbc83047_0    anaconda
libprotobuf               3.15.8               h780b84a_0    conda-forge
libssh2                   1.9.0                ha56f1ee_6    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
libtiff                   4.1.0                h2733197_1  
libuuid                   2.32.1            h7f98852_1000    conda-forge
libxml2                   2.9.10               h72842e0_4    conda-forge
locket                    0.2.0                      py_2    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
markupsafe                1.1.1            py38h7b6447c_0    anaconda
msgpack-python            1.0.2            py38h1fd1430_1    conda-forge
multidict                 5.1.0            py38h497a2fe_1    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.19.1           py38h30dfecb_0    anaconda
numpy-base                1.19.1           py38h75fe3a5_0    anaconda
olefile                   0.46                       py_0    anaconda
openssl                   1.1.1k               h7f98852_0    conda-forge
packaging                 20.4                       py_0    anaconda
pandas                    1.1.3            py38he6710b0_0    anaconda
partd                     1.2.0              pyhd8ed1ab_0    conda-forge
pillow                    8.0.0            py38h9a89aac_0    anaconda
pip                       21.1               pyhd8ed1ab_0    conda-forge
protobuf                  3.15.8           py38h709712a_0    conda-forge
psutil                    5.8.0            py38h497a2fe_1    conda-forge
pyarrow                   4.0.0                    pypi_0    pypi
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7                      py_0    anaconda
pysocks                   1.7.1            py38h578d9bd_3    conda-forge
python                    3.8.8           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    anaconda
python_abi                3.8                      1_cp38    conda-forge
pytz                      2020.1                     py_0    anaconda
pyyaml                    5.4.1            py38h497a2fe_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
s3fs                      2021.4.0           pyhd8ed1ab_0    conda-forge
setuptools                49.6.0           py38h578d9bd_3    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
skein                     0.8.1            py38h578d9bd_1    conda-forge
sortedcontainers          2.3.0              pyhd8ed1ab_0    conda-forge
sqlalchemy                1.4.11           py38h497a2fe_0    conda-forge
sqlite                    3.35.5               h74cdb3f_0    conda-forge
tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
toolz                     0.11.1                     py_0    conda-forge
tornado                   6.1              py38h497a2fe_1    conda-forge
typing-extensions         3.7.4.3                       0    conda-forge
typing_extensions         3.7.4.3                    py_0    anaconda
urllib3                   1.26.4             pyhd8ed1ab_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
wrapt                     1.12.1           py38h497a2fe_3    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
yarl                      1.6.3            py38h497a2fe_1    conda-forge
zict                      2.0.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge
jacobtomlinson commented 3 years ago

As you say it looks like the dumps_msgpack method was removed in dask/distributed#4677 and dask/dask#7525.

It also look like distributed 2021.4.1 should depend on dask 2021.4.1 and that is being discussed here.

As a workaround could you ensure you have the latest versions of both dask and distributed installed. It looks like you have an older version of dask in your environment.

ikerforce commented 3 years ago

I followed your instruction and installed with conda install -c conda-forge dask=2021.4.1 dask-core=2021.4.1 distributed=2021.4.1 dask-yarn and the issue is gone.

Hope that the default conda install works soon.

Thanks!

jacobtomlinson commented 3 years ago

That's great. I'm going to close this out as worked around, and things should be addressed in the conda recipe soon.