Open fjetter opened 2 years ago
Strongly related:
We could scale test_RetireWorker_stress
up to production size:
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1133-L1175
The integration test must replicate both use cases of the unit test above, with and without ReduceReplicas running alongside RetireWorker, as the two policies heavily interact with each other.
This story is done when the integration test portrays the behaviour of distributed on coiled as described above. If it demonstrates a flaw, remediation of the flaw is out of scope.
Graceful worker restart
Can we run a workload where workers are calling close_gracefully every minute? Can we successfully shift data from the dying to the living?
Let's try transforming a sizeable parquet dataset while also setting the worker's lifetime to something like one minute with restarts