ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
43 stars 27 forks source link

Dynamic sidecar shall have a docker healthcheck so that placement constraint is placed only if healthy #5837

Closed sanderegg closed 4 months ago

sanderegg commented 4 months ago

As shown in the reference below, the node constrain on the dynamic-sidecar is placed too soon when the dynamic-sidecar is not event started.

@GitHK so after some bug tracking analysis I found the following issues and we can discuss them if you wish. To follow-up on @matusdrobuliak66 findings I will add here a few things to summarize a bit what is going on.

  1. the user starts a service
  2. the dynamic scheduler somehow starts a dynamic sidecar and since there is no healthcheck defined in the dynamic sidecar, it is instantly reported as "running" by the docker swarm which is expected but is wrong and this is where the bug comes from.
  3. since the dynamic sidecar is deemed as running, the director-v2 sets the placement constraint on the sidecar
  4. after a while the dynamic sidecar fails due to whatever (in this case accessing RabbitMQ/storage or whatever), I agree that this is ok and would work if the machine is fixed
  5. the autoscaling runs every 5 seconds or so, so depending on the user's luck the sidecar might restart during that time and the autoscaling is looking for tasks on the nodes it created, if it does not find any it will set the node to "not ready" by changing a label on the node (this change was made after thorough analysis of @YuryHrytsuk )
  6. so assuming the dynamic sidecar service's task did not restart when the autoscaling monitors the active nodes, it will set it to "not ready" and when the next dynamic sidecar task starts there it is "rejected"
  7. then the dynamic sidecar service will revert to "pending" BUT with a node.id placement constraint
  8. autoscaling will see that guy again, but will skip it because it has the placement constraint (this is per design, as services that explicitly ask for a specific node can never start, whatever the autoscaling would do), therefore the dynamic sidecar will never ever start and remain pending forever
  9. autoscaling will terminate the machine after 3 minutes unless some other service is freshly started

Here the dynamic-sidecar Dockerfile part:

# disabled healthcheck as director-v2 is already taking care of it
# in oder to have similar performance a more aggressive healethcek
# would be required.
# removing the healthchek would not cause any issues at this point
# NOTE: When adding a healthcheck
# - remove UpdateHealth - no longer required
# - remove WaitForSidecarAPI  - no longer required
# - After `get_dynamic_sidecar_placement` inside CreateSidecars call
#   (the sidecar's API will be up and running; guaranteed by the docker engine healthck).
#   Add the following line `scheduler_data.dynamic_sidecar.is_ready = True`
#   The healthcheck guarantees that the API is available
HEALTHCHECK NONE

and here the code from the director-v2

async def _get_task_data_when_service_running(service_id: str) -> Mapping[str, Any]:
        """
        Waits for dynamic-sidecar task to be `running` and returns the
        task data.
        """
        task = await _get_service_latest_task(service_id)
        service_state = task["Status"]["State"]

        if service_state not in TASK_STATES_RUNNING:
            raise TryAgain
        return task

    task = await _get_task_data_when_service_running(service_id=service_id)

These is how it looks like in portainer: Image Image Image Image

So from here I will create one issue for the dynamic-sidecar. I think it just needs a docker Healthcheck like every other service we use.

I am going to test what happens when I restart a dynamic sidecar on an autoscaled node, as this might also break

Originally posted by @sanderegg in #5826

GitHK commented 4 months ago

@sanderegg so the idea from now on will be to have the http interface up and running before continuing, right?