Pending sidecar on Autoscaled node

matusdrobuliak66 commented 2 months ago

Which deploy/s?

tip.science

Issues

I see 3 issues/observations:

For some reason it takes a long time for dynamic sidecars to connect to Storage/RabbitMQ which causes it restarts for couple of times before it is running.
Sidecar doesn't start (REJECTED) - probably because of resources?
Sidecar doesn't start because while restarting, another sidecar steals the node.

Details of each issue (dropdown)

1. Sidecars can not connect to storage/rabbitmq

``` log_level=ERROR | log_timestamp=2024-05-15 07:51:45,306 | log_source=uvicorn.error:send(121) | log_uid=None | log_msg=Traceback (most recent call last):\n File "/home/scu/.venv/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan\n async with self.lifespan_context(app) as maybe_state:\n File "/home/scu/.venv/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__\n await self._router.startup()\n File "/home/scu/.venv/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup\n await handler()\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dynamic_sidecar/core/external_dependencies.py", line 35, in on_startup\n raise CouldNotReachExternalDependenciesError(failed=failed)\nsimcore_service_dynamic_sidecar.core.external_dependencies.CouldNotReachExternalDependenciesError: Could not start because the following external dependencies failed: ["Could not contact service 'RabbitMQ' at 'amqp://...@production_rabbit:5672'. Look above for details.", "Could not contact service 'Storage' at 'http://production_storage:8080/v0/'. Look above for details."]\n ``` ``` log_level=WARNING | log_timestamp=2024-05-15 07:51:44,292 | log_source=simcore_service_dynamic_sidecar.modules.service_liveness:log_it(30) | log_uid=None | log_msg=Retrying (attempt 30) to contact 'RabbitMQ' at 'amqp://...@production_rabbit:5672' in 1.0 seconds. ``` What is different in this deployment is the DNS override. But I am not sure whether it is the issue causing this, I would even argue that it should not be the cause of it.

2. Sidecar doesn't start (REJECTED)

1. firstly failed (can not connect to rabbitmq/storage) 2. then rejected 3. then pending 4. then autoscaling removes the node 5. hangs there in a pending state ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/a0138c69-c473-4c1c-a7db-87c6a990b0ea) ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/5ac0544a-649c-4ba1-95b6-56a58974efbb) ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/87a2a3df-fe42-4498-ab45-ed8d58868eb8)

3. Sidecars don't start. Another sidecar steals the node.

This issue happened once, but I am not able to reproduce it anymore now. (I guess it might be rare) ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/8d8f9ca4-26b1-42bb-99a9-01b6e1eb40ac)

### Tasks
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5837
- [ ] https://github.com/ITISFoundation/osparc-simcore/issues/5840

GitHK commented 2 months ago

I think it's ok and is working as designed
Maybe while they restart another container from the "oSPARC or OPS" stacks starts and uses up the resources. Could we double check this somehow?
I think it could be a consequence of 2., nut we do not have control over it, since it's docker that decides when and where to start containers. Also this might be a consequence of the fact that while a container is restarting, the resources for that node become available for the taking and a new service can be scheduled there

sanderegg commented 2 months ago

@GitHK so after some bug tracking analysis I found the following issues and we can discuss them if you wish. To follow-up on @matusdrobuliak66 findings I will add here a few things to summarize a bit what is going on.

the user starts a service
the dynamic scheduler somehow starts a dynamic sidecar and since there is no healthcheck defined in the dynamic sidecar, it is instantly reported as "running" by the docker swarm which is expected but is wrong and this is where the bug comes from.
since the dynamic sidecar is deemed as running, the director-v2 sets the placement constraint on the sidecar
after a while the dynamic sidecar fails due to whatever (in this case accessing RabbitMQ/storage or whatever), I agree that this is ok and would work if the machine is fixed
the autoscaling runs every 5 seconds or so, so depending on the user's luck the sidecar might restart during that time and the autoscaling is looking for tasks on the nodes it created, if it does not find any it will set the node to "not ready" by changing a label on the node (this change was made after thorough analysis of @YuryHrytsuk )
so assuming the dynamic sidecar service's task did not restart when the autoscaling monitors the active nodes, it will set it to "not ready" and when the next dynamic sidecar task starts there it is "rejected"
then the dynamic sidecar service will revert to "pending" BUT with a node.id placement constraint
autoscaling will see that guy again, but will skip it because it has the placement constraint (this is per design, as services that explicitly ask for a specific node can never start, whatever the autoscaling would do), therefore the dynamic sidecar will never ever start and remain pending forever
autoscaling will terminate the machine after 3 minutes unless some other service is freshly started

Here the dynamic-sidecar Dockerfile part:

# disabled healthcheck as director-v2 is already taking care of it
# in oder to have similar performance a more aggressive healethcek
# would be required.
# removing the healthchek would not cause any issues at this point
# NOTE: When adding a healthcheck
# - remove UpdateHealth - no longer required
# - remove WaitForSidecarAPI  - no longer required
# - After `get_dynamic_sidecar_placement` inside CreateSidecars call
#   (the sidecar's API will be up and running; guaranteed by the docker engine healthck).
#   Add the following line `scheduler_data.dynamic_sidecar.is_ready = True`
#   The healthcheck guarantees that the API is available
HEALTHCHECK NONE

and here the code from the director-v2

async def _get_task_data_when_service_running(service_id: str) -> Mapping[str, Any]:
        """
        Waits for dynamic-sidecar task to be `running` and returns the
        task data.
        """
        task = await _get_service_latest_task(service_id)
        service_state = task["Status"]["State"]

        if service_state not in TASK_STATES_RUNNING:
            raise TryAgain
        return task

    task = await _get_task_data_when_service_running(service_id=service_id)

These is how it looks like in portainer:

So from here I will create one issue for the dynamic-sidecar. I think it just needs a docker Healthcheck like every other service we use.

I am going to test what happens when I restart a dynamic sidecar on an autoscaled node, as this might also break

sanderegg commented 2 months ago

@matusdrobuliak66 @GitHK there is a second issue. If a dynamic-sidecar runs on an autoscaled node (our 80% use-case now), and someone restarts the dynamic-sidecar, the very same will happen. I guess I will need to create some spaghetti code for this. will create the issue and find a solution.

sanderegg commented 2 months ago

The final issue that is still not resolved was moved to the following issue:

https://github.com/ITISFoundation/osparc-simcore/issues/5862

ITISFoundation / osparc-simcore