Restarting a dynamic sidecar on an autoscaled node triggers node drain and termination

sanderegg commented 1 month ago

If the dynamic-sidecar of a service is restarted while running on an autoscaled node there are high chances that the node will be drained and terminated during that process.

Here is the explanation why this happens:

dynamic-sidecar running with its containers on autoscaled node
restart dynamic-sidecar via portainer
autoscaling finds an empty node, and sets the osparc-ready label to false
the restarted sidecar is rejected,
autoscaling then skips sidecars with a node.id constraint
autoscaling will then terminate the node
service is gone, with probable data loss

It is probably also possible that another dynamic sidecar would steal that node.

### Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5842

sanderegg commented 1 month ago

what happens when the sidecar is restarted:

(docker default behavior)

the sidecar is stopped,
some time passes <-- if there are other dynamic sidecars starting, they could steal the node. if there are none and autoscaling runs during that time it might set the node to "not-ready"
the replacing sidecar is started

would changing the start phase to `start-first` fix it?

the replacing sidecar is started (but would not start due to missing resources)
this would fail the restart due to missing resources.... so not an option

add some delay in autoscaling

adding a delay before a node is set to "not-ready" --> would probably fix the issue where the sidecar is rejected --> will NOT fix the possibility for another sidecar to steal the node

sanderegg commented 1 month ago

remaining issue:

https://github.com/ITISFoundation/osparc-simcore/issues/5862

ITISFoundation / osparc-simcore

Restarting a dynamic sidecar on an autoscaled node triggers node drain and termination #5840

what happens when the sidecar is restarted:

would changing the start phase to `start-first` fix it?

add some delay in autoscaling

ITISFoundation / osparc-simcore

Restarting a dynamic sidecar on an autoscaled node triggers node drain and termination #5840

what happens when the sidecar is restarted:

would changing the start phase to start-first fix it?

add some delay in autoscaling

would changing the start phase to `start-first` fix it?