dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Long key causes stuck cluster on evict() #3737

Open crusaderky opened 4 years ago

crusaderky commented 4 years ago

dask client: Linux x64, NFSv4 NFS server: Linux x64, btrfs

My dask cluster got completely stuck. Looking at the GUI, I can read there are 5 processing tasks, but everything is frozen at 0% CPU. In the worker logs, I can read:

Traceback (most recent call last):
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/distributed/worker.py", line 2653, in memory_monitor
    k, v, weight = self.data.fast.evict()
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/zict/lru.py", line 89, in evict
    cb(k, v)
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/zict/buffer.py", line 60, in fast_to_slow
    self.slow[key] = value
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/zict/func.py", line 41, in __setitem__
    self.d[key] = self.dump(value)
  File "/home/crusaderky/miniconda3/envs/commrisk/lib/python3.8/site-packages/zict/file.py", line 79, in __setitem__
    with open(os.path.join(self.directory, _safe_key(key)), "wb") as f:
OSError: [Errno 36] File name too long: '/nfs/crusaderky/prg/github/CommRisk-Core/dask-worker-space/worker-bfz0z0y1/storage/mtm.compute_trade_item-mtm.compute_portfolio-PMI%2FPMI2013%2FPMI%20TRD%2FPMI%20TRD%20MEX%2FREFINADOS%2FGASOLINAS%20Y%20COMPONENTES%2FAlmacenes%20y%20ductos%202013%2FAlm%202013-10%2FCiudad%20Ju%C3%A1rez%202013-10%2FCJ%20Ducto%20Plains%202013-10-5ea19e760000000000000003'

mtm.compute_trade_item-mtm.compute_portfolio-PMI%2FPMI2013%2FPMI%20TRD%2FPMI%20TRD%20MEX%2FREFINADOS%2FGASOLINAS%20Y%20COMPONENTES%2FAlmacenes%20y%20ductos%202013%2FAlm%202013-10%2FCiudad%20Ju%C3%A1rez%202013-10%2FCJ%20Ducto%20Plains%202013-10-5ea19e760000000000000003 is a key of the dask graph that I hand-crafted.

Because of the very nature of evict(), this issue is particularly insidious because it won't appear until I reach production-level data volumes.

Workaround

Change my application to introduce scrambling if the keys it generates exceed a maximum length.

Expected behaviour

Ideally, there should be no limit to the length of dask keys. As a second best option, the client should deterministically raise an Exception as soon as the user tries uploading a very long key to the scheduler.

TomAugspurger commented 4 years ago

Thanks for tracking that down. This seems a bit tricky...

Ideally, there should be no limit to the length of dask keys.

For that to be an option, zict would need to be able to deterministically hash keys into something that works for the OS right? That seems doable.

As a second best option, the client should deterministically raise an Exception as soon as the user tries uploading a very long key to the scheduler.

How would we know what "too long" is? Given a static set of workers, each worker could query the OS to find the maximum file length... But if a new worker comes along?

crusaderky commented 4 years ago

How would we know what "too long" is? Given a static set of workers, each worker could query the OS to find the maximum file length

I was just thinking about hardcoding the minimum common denominator among modern OSs/filesystems.