coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Occasional deadlock during cluster creation with package_sync #213

Closed fjetter closed 1 year ago

fjetter commented 2 years ago

Fairly frequently I'm running into a weird deadlock during cluster creation when using package_sync.

I'm typically building distributed from a local folder and see the following output

Processing /Users/fjetter/workspace/distributed-main
Processing /Users/fjetter/workspace/dask
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: dask
  Building wheel for dask (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: distributed
  Building wheel for distributed (setup.py): started
  Building wheel for dask (setup.py): finished with status 'done'
  Created wheel for dask: filename=dask-2022.9.0+14.gb817dfb62-py3-none-any.whl size=1119872 sha256=0a5ae78a473299548b279e486a8792d1cf4440925b4034d5251f5e5a484dd3e2
  Stored in directory: /private/var/folders/h0/kd1gdptx7gzb6cbywplfrx1r0000gn/T/pip-ephem-wheel-cache-9whqfzg8/wheels/73/73/93/b55f69f2b4685dacb3af66c33555b6efa9bb04141b54cefc49
Successfully built dask
  Building wheel for distributed (setup.py): finished with status 'done'
  Created wheel for distributed: filename=distributed-2022.10.2+27.g61cb71965-py3-none-any.whl size=1732618 sha256=6a0264f959385a0858dfdf4e10858d005ca23dfa0acebd69dd20ab419c462cb3
  Stored in directory: /Users/fjetter/Library/Caches/pip/wheels/09/33/4f/4cfe956e813b7434fdfa0eb4fe25eb4e2189493ca03cfdd406
Successfully built distributed

afterwards nothings happens.

If successful, I see the following lines afterwards

Dropped Package - dask, Wheel built from /Users/fjetter/workspace/dask
Dropped Package - distributed, Wheel built from /Users/fjetter/workspace/distributed-main
╭───────────────────────────────────────── Package Issues ─────────────────────────────────────────╮
│                 ╷                                                                 ╷              │
│   Package       │ Issue                                                           │ Risk Level   │
│ ╶───────────────┼─────────────────────────────────────────────────────────────────┼────────────╴ │
│   distributed   │ Wheel built from /Users/fjetter/workspace/distributed-main      │ Critical     │
│   dask          │ Wheel built from /Users/fjetter/workspace/dask                  │ Critical     │
│   libgfortran5  │ 11.0.1.dev0 has no install candidate for linux-64               │              │
│   libgfortran   │ 5.0.0.dev0 has no install candidate for linux-64                │              │
│   grpc-cpp      │ 1.45.2 has no install candidate for linux-64                    │              │
│   arrow-cpp     │ 8.0.1 has no install candidate for linux-64                     │              │
│   openssl       │ Package ignored                                                 │              │
│   abseil-cpp    │ Package ignored                                                 │              

To resolve I need to retry, sometimes a couple of times. The issue does not show up in our internal monitoring (I don't see anything in live-errors)

shughes-uk commented 2 years ago

I've seen this sometimes but have not come up with a consistent reproducer. I have suspicions it's related to asyncio.create_subprocess_shell

fjetter commented 2 years ago

If there are any debug logs or smth like this I can enable to help us debug this further, let me know. I run into this fairly frequently

shughes-uk commented 2 years ago

@fjetter yeah just enable info logging for the coiled logger or pass configure_logging=True to Cluster. Might help a little

shughes-uk commented 1 year ago

Switched to a threadpool for this and haven't see anything locally since. I'm assuming we're good.