Closed johnbensnyder closed 5 years ago
@martindurant
It seems that I was causing the problem. The problem doesn't seems to be with PyArrow or Fastparquet. Because it happens in when I try to read a csv with dask.dataframe.read_csv using usecols optional argument passing a dict_keys object instead of a list. Then, dask tries to serialize the following object (that can be seen in my previous post) is created:
['usecols', dict_keys(...)]
Removing the usecols or using usecols=list() argument fixes the problem. I'm sorry for the trouble.
Do you want me to try to fix the problem and submit a pull request?
I'm not sure there's anything to fix: the (pandas) docstring says that usecols
must be list-like.
In any case, I'll close this, since we now know what's going on.
I got the same error in Jul 10, 2020 MacPro 2017 macos catalina with miniconda env
How to solve the error? Is this because I have a single laptop and for distributed we need a cluster of multiple computers?
# imports
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask.array as da
import dask_ml
import pyarrow
print([(x.__name__,x.__version__) for x in
[np,pd, dask, dask_ml,pyarrow]])
[('numpy', '1.18.5'), ('pandas', '1.0.5'), ('dask', '2.20.0'), ('dask_ml', '1.5.0'), ('pyarrow', '0.17.1')]
# data
a = da.random.normal(size=(2000, 2000), chunks=(1000, 1000)) # data
res = a.dot(a.T).mean(axis=0) # operation
res = res.persist() # start computation in the background
# code
from dask.distributed import Client, progress
client = Client() # use dask.distributed by default
progress(res) # watch progress
res.compute() # convert to final result when done if desired
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/protocol/core.py", line 106, in loads
header = msgpack.loads(header, use_list=False, **msgpack_opts)
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 135, in unpackb
ret = unpacker._unpack()
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 676, in _unpack
ret[key] = self._unpack(EX_CONSTRUCT)
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 672, in _unpack
"%s is not allowed for map key" % str(type(key))
ValueError: <class 'tuple'> is not allowed for map key
distributed.core - ERROR - <class 'tuple'> is not allowed for map key
Traceback (most recent call last):
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/core.py", line 456, in handle_stream
msgs = await comm.read()
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/comm/tcp.py", line 212, in read
frames, deserialize=self.deserialize, deserializers=deserializers
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/comm/utils.py", line 69, in from_frames
res = _from_frames()
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/comm/utils.py", line 55, in _from_frames
frames, deserialize=deserialize, deserializers=deserializers
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/protocol/core.py", line 106, in loads
header = msgpack.loads(header, use_list=False, **msgpack_opts)
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 135, in unpackb
ret = unpacker._unpack()
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 676, in _unpack
ret[key] = self._unpack(EX_CONSTRUCT)
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 672, in _unpack
"%s is not allowed for map key" % str(type(key))
ValueError: <class 'tuple'> is not allowed for map key
distributed.core - ERROR - <class 'tuple'> is not allowed for map key
Traceback (most recent call last):
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/core.py", line 412, in handle_comm
result = await result
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/scheduler.py", line 2491, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/core.py", line 456, in handle_stream
msgs = await comm.read()
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/comm/tcp.py", line 212, in read
frames, deserialize=self.deserialize, deserializers=deserializers
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/comm/utils.py", line 69, in from_frames
res = _from_frames()
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/comm/utils.py", line 55, in _from_frames
frames, deserialize=deserialize, deserializers=deserializers
File "/Users/poudel/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/protocol/core.py", line 106, in loads
header = msgpack.loads(header, use_list=False, **msgpack_opts)
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 135, in unpackb
ret = unpacker._unpack()
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 676, in _unpack
ret[key] = self._unpack(EX_CONSTRUCT)
File "/Users/poudel/.local/lib/python3.7/site-packages/msgpack-1.0.0rc1-py3.7-macosx-10.7-x86_64.egg/msgpack/fallback.py", line 672, in _unpack
"%s is not allowed for map key" % str(type(key))
ValueError: <class 'tuple'> is not allowed for map key
---------------------------------------------------------------------------
CancelledError Traceback (most recent call last)
<ipython-input-7-4d5bc96d9bb1> in <module>
1 progress(res) # watch progress
2
----> 3 res.compute() # convert to final result when done if desired
~/opt/miniconda3/envs/dsk/lib/python3.7/site-packages/dask/base.py in compute(self, **kwargs)
164 dask.base.compute
165 """
--> 166 (result,) = compute(self, traverse=False, **kwargs)
167 return result
168
~/opt/miniconda3/envs/dsk/lib/python3.7/site-packages/dask/base.py in compute(*args, **kwargs)
442 postcomputes.append(x.__dask_postcompute__())
443
--> 444 results = schedule(dsk, keys, **kwargs)
445 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
446
~/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2590 should_rejoin = False
2591 try:
-> 2592 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2593 finally:
2594 for f in futures.values():
~/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1888 direct=direct,
1889 local_worker=local_worker,
-> 1890 asynchronous=asynchronous,
1891 )
1892
~/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
766 else:
767 return sync(
--> 768 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
769 )
770
~/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
343 if error[0]:
344 typ, exc, tb = error[0]
--> 345 raise exc.with_traceback(tb)
346 else:
347 return result[0]
~/.local/lib/python3.7/site-packages/distributed-2.9.3-py3.7.egg/distributed/utils.py in f()
327 if callback_timeout is not None:
328 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 329 result[0] = yield future
330 except Exception as exc:
331 error[0] = sys.exc_info()
~/opt/miniconda3/envs/dsk/lib/python3.7/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
CancelledError:
I'm not sure there's anything to fix: the (pandas) docstring says that
usecols
must be list-like.
Can't we warn users with a ValueError exception?
I don't think we're generally in a position to check the types of arguments that we pass on to other functions.
Reading from Parquet is failing with PyArrow 0.13. Downgrading to PyArrow 0.12.1 seems to fix the problem. I've only encountered this when using the distributed client. Using a Dask dataframe by itself does not appear to be affected.
For example,
Gives
Similarly,
Causes this error