As shown in the reference below, the node constrain on the dynamic-sidecar is placed too soon when the dynamic-sidecar is not event started.
@GitHK so after some bug tracking analysis I found the following issues and we can discuss them if you wish.
To follow-up on @matusdrobuliak66 findings I will add here a few things to summarize a bit what is going on.
the user starts a service
the dynamic scheduler somehow starts a dynamic sidecar and since there is no healthcheck defined in the dynamic sidecar, it is instantly reported as "running" by the docker swarm which is expected but is wrong and this is where the bug comes from.
since the dynamic sidecar is deemed as running, the director-v2 sets the placement constraint on the sidecar
after a while the dynamic sidecar fails due to whatever (in this case accessing RabbitMQ/storage or whatever), I agree that this is ok and would work if the machine is fixed
the autoscaling runs every 5 seconds or so, so depending on the user's luck the sidecar might restart during that time and the autoscaling is looking for tasks on the nodes it created, if it does not find any it will set the node to "not ready" by changing a label on the node (this change was made after thorough analysis of @YuryHrytsuk )
so assuming the dynamic sidecar service's task did not restart when the autoscaling monitors the active nodes, it will set it to "not ready" and when the next dynamic sidecar task starts there it is "rejected"
then the dynamic sidecar service will revert to "pending" BUT with a node.id placement constraint
autoscaling will see that guy again, but will skip it because it has the placement constraint (this is per design, as services that explicitly ask for a specific node can never start, whatever the autoscaling would do), therefore the dynamic sidecar will never ever start and remain pending forever
autoscaling will terminate the machine after 3 minutes unless some other service is freshly started
# disabled healthcheck as director-v2 is already taking care of it
# in oder to have similar performance a more aggressive healethcek
# would be required.
# removing the healthchek would not cause any issues at this point
# NOTE: When adding a healthcheck
# - remove UpdateHealth - no longer required
# - remove WaitForSidecarAPI - no longer required
# - After `get_dynamic_sidecar_placement` inside CreateSidecars call
# (the sidecar's API will be up and running; guaranteed by the docker engine healthck).
# Add the following line `scheduler_data.dynamic_sidecar.is_ready = True`
# The healthcheck guarantees that the API is available
HEALTHCHECK NONE
As shown in the reference below, the node constrain on the dynamic-sidecar is placed too soon when the dynamic-sidecar is not event started.
Originally posted by @sanderegg in #5826