Open noloerino opened 2 years ago
Is it possible to upgrade your version of Dask ? The most recent version of Dask is 2022.7.0
Though it's worth noting that particular code path still exists: https://github.com/dask/distributed/blob/02f9c8f21adcda6026207942f14376ffeb0b4c62/distributed/scheduler.py#L5781-L5783
I'm not sure if I'm able to bump the Dask version without breaking anything, but it's hard to tell if doing so would fix the problem. Like you said, the code path is still around, and more importantly, the failure seems to be very transient (as far as I know it's the first time it's appeared in our CI, and rerunning the test made the error go away).
There are plans to re-implement replicate
using AMM, since it already has a number of issues: https://github.com/dask/distributed/issues/6578. We can add this one to the list. cc @crusaderky.
I had this happen as well, but only after a very long sequence of homogenous tasks was processed successfully with lots of worker pausing on the cluster (which aligns with @mvashishtha's theory). So, I can confirm the race condition is still possible in version 2024.2.0
, though I don't know how to reproduce it.
What happened: While running CI for another project, we encountered a
ValueError("Sample larger than population or is negative")
originating from this line of code in scheduler.py, which in turn is triggered by the user invokingClient.replicate()
. The relevant CI run can be found here; the full stack trace is also pasted below.Per @mvashishtha, a posible cause is that a Dask worker is paused between calculating
count
and computingtuple(workers - ts._who_has)
, causing a crash whenrandom.sample
is caused. We do not have a minimal example because we're unsure how to reliably set up this scenario.Full stack trace:
What you expected to happen: above crash does not occur
Minimal Complete Verifiable Example: N/A, unsure how to set up relevant conditions
Anything else we need to know?:
Environment: