Losing a spot instance fails poorly, sometimes leaving the cluster hanging

coiled / feedback

A place to provide Coiled feedback

14 stars 3 forks source link

Losing a spot instance fails poorly, sometimes leaving the cluster hanging #192

Closed phobson closed 1 year ago

phobson commented 2 years ago

It'd be nice if coiled (or dask?) had a Worker Plugin that listen for the signal from AWS that it was being revoked, allowing it to fail gracefully.

Spark cluster are (reportedly) pretty good about this and self-heal after a spot instance fails.

This came up during a call with a customer.

ntabris commented 2 years ago

I'm curious, does sigterm to nanny/worker trigger graceful shutdown? Or does one have to explicitly call scheduler.retire_workers() for graceful shutdown (i.e., shuffling completed work before it disappears)?

fjetter commented 2 years ago

We only listen to signals if the CLI is used

fjetter commented 2 years ago

If the cluster is "hanging" that is something we need to investigate. This must not happen regardless of failing workers. @phobson can you point us to something more concrete?

ntabris commented 2 years ago

If the cluster is "hanging" that is something we need to investigate. This must not happen regardless of failing workers.

They haven't tried spot for many months, so we encouraged them to try spot again. If they report that this is still happening, they'll let us know. (We were hoping that maybe whatever the issue was had already been fixed.)

phobson commented 2 years ago

@fjetter exactly what Nat said. It's been a long while since they've used spot instances that triggered this behavior. So hopefully this has been resolved with the latest stability improvements in distributed.

In any case, workers can die. What's everyone's feeling about a spark-like self-healing cluster? I was under the impression that Coiled clusters would try to restart a worker if it failed. But perhaps that's different when using Spot instances.

ntabris commented 2 years ago

What's everyone's feeling about a spark-like self-healing cluster? I was under the impression that Coiled clusters would try to restart a worker if it failed. But perhaps that's different when using Spot instances.

Coiled per se doesn't currently do anything related to self-healing.

The nanny will (maybe) restart worker if, e.g., the worker hits memory limit (I know this because it works better now that memory limit is set).

We're thinking of (maybe just one) restart of the container running dask if it exists w/ error code. We have to be careful here and not just keep restarting, that's why we're thinking single restart. (That's been on my TODO list, I should probably just go do it shortly.)

As far as restarting the whole instance, again we want to be careful with this if we don't know why the instance died, but I think this probably makes sense for spot instance interruption.

fjetter commented 2 years ago

Coiled per se doesn't currently do anything related to self-healing.

This has been a recurring problem. Is there an issue to track this? I have thoughts and a strong opinion that Coiled should restart workers all day long. Dask has a mechanism to fail a computation if workers failed too often. I think we should have this conversation on a dedicated ticket to settle this question.

ntabris commented 2 years ago

Specifically for self-healing spot cluster, I've created https://github.com/coiled/platform/issues/47

Mostly this is a "we think this is a good idea but haven't gotten around to it yet".

shughes-uk commented 1 year ago

We restore the spot instance now. @ntabris I can't remember if we got to graceful worker shutdown? Pretty sure we did

dchudz commented 1 year ago

I can't remember if we got to graceful worker shutdown? Pretty sure we did

yes we did.