coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

bad value(s) in fds_to_keep #65

Closed mrocklin closed 1 year ago

mrocklin commented 4 years ago

We're debugging a tricky issue between Dask and Django/Daphne. I thought I'd record some of this here in case other people run into this same problem in the future.

Coiled launches a Dask cluster as part of the web backend. We use this for offloading computation for things like solving conda environments. Recently we did some work to merge the Dask and Django/Daphne event loops, and now we're getting the following error when we start up a Dask worker/nanny process

  File "./backends/core.py", line 52, in start
    dashboard_address=":0",
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/site-packages/distributed/deploy/spec.py", line 386, in _
    await self._correct_state()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/site-packages/distributed/deploy/spec.py", line 355, in _correct_state_internal
    await w  # for tornado gen.coroutine support
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/site-packages/distributed/core.py", line 305, in _
    await self.start()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/site-packages/distributed/nanny.py", line 295, in start
    response = await self.instantiate()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/site-packages/distributed/nanny.py", line 378, in instantiate
    result = await self.process.start()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/site-packages/distributed/nanny.py", line 549, in start
    self.init_result_q = init_q = mp_context.Queue()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/context.py", line 102, in Queue
    return Queue(maxsize, ctx=self.get_context())
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/queues.py", line 42, in __init__
    self._rlock = ctx.Lock()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/synchronize.py", line 80, in __init__
    register(self._semlock.name)
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/semaphore_tracker.py", line 83, in register
    self._send('REGISTER', name)
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/semaphore_tracker.py", line 90, in _send
    self.ensure_running()
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/semaphore_tracker.py", line 71, in ensure_running
    pid = util.spawnv_passfds(exe, args, fds_to_pass)
  File "/home/mrocklin/miniconda/envs/coiled-37/lib/python3.7/multiprocessing/util.py", line 455, in spawnv_passfds
    False, False, None)
ValueError: bad value(s) in fds_to_keep

This fails on Python 3.7 and 3.8. It fails when using the spawn and forkserver multiprocessing contexts, but doesn't seem to fail when we use fork.

No one has to do anything for this. I'm just recording this because it ended up being a tricky debugging issue, and because information on this error on the web is sparse.

mrocklin commented 4 years ago

Things also seem better if we spawn a process earlier when setting things up. In particular I'm running the following:

from distributed.utils import mp_context
mp_context.Process(target=time.sleep, args=(0,)).start()

Early on in our asgi application script, right around where django.setup() is called.

shughes-uk commented 1 year ago

Closing this to tidy things up. Should still show up if people are searching.