TBD54566975 / ftl

FTL - Towards a 𝝺-calculus for large-scale systems
https://tbd54566975.github.io/ftl/
Apache License 2.0
22 stars 7 forks source link

Handle failed deployments #3072

Open stuartwdouglas opened 1 month ago

stuartwdouglas commented 1 month ago

At present if you deploy something that ends up in CrashLoopBackOff FTL will wait forever. We need to be able to handle failed deployments without hanging.

alecthomas commented 1 month ago

IIRC this used to work prior to the change to a pull model. As part of the runner state machine, there were timeouts for readiness after which a runner was rejected by the controller and a new runner scheduled.

stuartwdouglas commented 1 month ago

The runners time out and restart AFAIK, the issue is that if the new runner fails as well the 'deploy' operation just hangs. At some point the controller needs to decide that the deployment just isn't working and abort, keeping the old deployment if it exists.

alecthomas commented 1 month ago

Ah, I see what you're saying 👍