fjetter commented 2 years ago

Graceful worker restart

Can we run a workload where workers are calling close_gracefully every minute? Can we successfully shift data from the dying to the living?

Let's try transforming a sizeable parquet dataset while also setting the worker's lifetime to something like one minute with restarts

crusaderky commented 2 years ago

Strongly related:

140

We could scale test_RetireWorker_stress up to production size:

https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1133-L1175

The integration test must replicate both use cases of the unit test above, with and without ReduceReplicas running alongside RetireWorker, as the two policies heavily interact with each other.

crusaderky commented 2 years ago

DOD / AC

This story is done when the integration test portrays the behaviour of distributed on coiled as described above. If it demonstrates a flaw, remediation of the flaw is out of scope.

coiled / benchmarks

Integration tests: Graceful worker restart #135

Graceful worker restart

140

DOD / AC