If the dynamic-sidecar of a service is restarted while running on an autoscaled node there are high chances that the node will be drained and terminated during that process.
Here is the explanation why this happens:
dynamic-sidecar running with its containers on autoscaled node
restart dynamic-sidecar via portainer
autoscaling finds an empty node, and sets the osparc-ready label to false
the restarted sidecar is rejected,
autoscaling then skips sidecars with a node.id constraint
autoscaling will then terminate the node
service is gone, with probable data loss
It is probably also possible that another dynamic sidecar would steal that node.
some time passes <-- if there are other dynamic sidecars starting, they could steal the node. if there are none and autoscaling runs during that time it might set the node to "not-ready"
the replacing sidecar is started
would changing the start phase to start-first fix it?
the replacing sidecar is started (but would not start due to missing resources)
this would fail the restart due to missing resources.... so not an option
add some delay in autoscaling
adding a delay before a node is set to "not-ready"
--> would probably fix the issue where the sidecar is rejected
--> will NOT fix the possibility for another sidecar to steal the node
If the dynamic-sidecar of a service is restarted while running on an autoscaled node there are high chances that the node will be drained and terminated during that process.
Here is the explanation why this happens:
It is probably also possible that another dynamic sidecar would steal that node.