Open hendrikmakait opened 1 year ago
The negative occupancy is caused by the barrier changing it's size, which causes us to subtract more from network_occ
than we originally added:
distributed.scheduler - ERROR - ('shuffle-barrier-98dccb6094b5f2a50718c2178e4fe090', 24, 28)
Traceback (most recent call last):
File "/opt/coiled/env/lib/python3.10/site-packages/distributed/scheduler.py", line 821, in add_replica
assert nbytes == self.needs_what[ts][1], (
AssertionError: ('shuffle-barrier-98dccb6094b5f2a50718c2178e4fe090', 24, 28)
where 24
is the original size and 28
the size we try to subtract.
Also TIL: sizeof(0) == 24
and sizeof(28) == 28
Hi, I keep getting a similar negative occupancy issue
File "<...>/site-packages/distributed/scheduler.py", line 1818, in _calc_occupancy
assert occ >= 0, occ
AssertionError: -7191.262672288433
Has this been fixed at some point ? I currently have version 2023.5.1
@FlorianBury, this has not been fixed but been circumvented in the original problem. Would you have a minimal reproducer for your problem that can help us investigate?
For some reason,
WorkerState.network_occ
can drop below zero and cause the state machine to corrupt.Reproducer:
Logs:
because we do not properly release the task, this in turn seems to cause
and on the worker:
XREF: #7538