dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

`SystemError: unknown opcode` with `SchedulerPlugin` when client uses Python3.10 and scheduler Python3.9 #7016

Open hendrikmakait opened 2 years ago

hendrikmakait commented 2 years ago

Reproducer

  1. Start a scheduler with a Python3.9 environment
  2. Run the following code with a Python3.10 environment
from __future__ import annotations

from distributed import Client
from distributed.diagnostics import SchedulerPlugin

client = Client("<ADDRESS>")

class TestPlugin(SchedulerPlugin):
    def start(self, scheduler):
        scheduler.handlers["test"] = self.test

    async def test(self, comm):
        return None

client.register_scheduler_plugin(TestPlugin(), idempotent=True)
client.sync(client.scheduler.test)

Outcome

Client

Traceback (most recent call last):
  File "/Users/hendrikmakait/projects/dask/distributed/reproducer.py", line 18, in <module>
    client.sync(client.scheduler.test)
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/utils.py", line 339, in sync
    return sync(
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/utils.py", line 379, in f
    result = yield future
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/dask-distributed-py3.10/lib/python3.10/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/core.py", line 1154, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/core.py", line 944, in send_recv
    raise exc.with_traceback(tb)
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/core.py", line 770, in _handle_comm
    result = await result
  File "/Users/hendrikmakait/projects/dask/distributed/reproducer.py", line 13, in test
    async def test(self, comm):
SystemError: unknown opcode

Scheduler logs

XXX lineno: 13, opcode: 129
2022-09-07 16:05:48,646 - distributed.core - ERROR - Exception while handling op test
Traceback (most recent call last):
  File "/Users/hendrikmakait/projects/dask/distributed/distributed/core.py", line 770, in _handle_comm
    result = await result
  File "/Users/hendrikmakait/projects/dask/distributed/reproducer.py", line 13, in test
    async def test(self, comm):
SystemError: unknown opcode

In a more complicated use case, this has led to a segfault instead of a SystemError on Coiled:

ep  6 09:40:12 ip-10-0-1-28 cloud-init[1133]: 2022-09-06 09:40:12,001 - distributed.scheduler - INFO - Clear task state
Sep  6 09:40:13 ip-10-0-1-28 kernel: [   90.688881] python[1517]: segfault at 8 ip 0000563f29032c44 sp 00007ffc41695700 error 4 in python3.9[563f28f45000+209000]
Sep  6 09:40:13 ip-10-0-1-28 kernel: [   90.688895] Code: 0f 84 90 00 00 00 48 8d 15 19 f0 23 00 48 39 d7 74 0c 48 8d 0d ed ae 24 00 48 39 cf 75 08 31 c0 c3 0f 1f 44 00 00 48 83 ec 08 <48> 8b 77 08 4c 8b 46 60 4d 85 c0 74 2f 49 8b 40 48 48 85 c0 74 26
jrbourbeau commented 2 years ago

Thanks for reporting @hendrikmakait. In general I think we want Python versions to match.

hendrikmakait commented 2 years ago

@jrbourbeau: I'd agree with that and I think it's a larger discussion to have wrt. to dealing with mismatches. See #7017. My main problem with this bug is that according to my understanding, #4011 should enable us to be use mismatching versions and this particular mismatch does not even cause a warning (fix in #7018).