ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
44 stars 26 forks source link

Restarting a dynamic sidecar on an autoscaled node triggers node drain and termination #5840

Closed sanderegg closed 1 month ago

sanderegg commented 1 month ago

If the dynamic-sidecar of a service is restarted while running on an autoscaled node there are high chances that the node will be drained and terminated during that process.

Here is the explanation why this happens:

  1. dynamic-sidecar running with its containers on autoscaled node
  2. restart dynamic-sidecar via portainer
  3. autoscaling finds an empty node, and sets the osparc-ready label to false
  4. the restarted sidecar is rejected,
  5. autoscaling then skips sidecars with a node.id constraint
  6. autoscaling will then terminate the node
  7. service is gone, with probable data loss

It is probably also possible that another dynamic sidecar would steal that node.

### Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5842
sanderegg commented 1 month ago

what happens when the sidecar is restarted:

(docker default behavior)

  1. the sidecar is stopped,
  2. some time passes <-- if there are other dynamic sidecars starting, they could steal the node. if there are none and autoscaling runs during that time it might set the node to "not-ready"
  3. the replacing sidecar is started

would changing the start phase to start-first fix it?

  1. the replacing sidecar is started (but would not start due to missing resources)
  2. this would fail the restart due to missing resources.... so not an option

add some delay in autoscaling

sanderegg commented 1 month ago

remaining issue: