Closed matusdrobuliak66 closed 2 months ago
2.
, nut we do not have control over it, since it's docker that decides when and where to start containers. Also this might be a consequence of the fact that while a container is restarting, the resources for that node become available for the taking and a new service can be scheduled there@GitHK so after some bug tracking analysis I found the following issues and we can discuss them if you wish. To follow-up on @matusdrobuliak66 findings I will add here a few things to summarize a bit what is going on.
Here the dynamic-sidecar Dockerfile part:
# disabled healthcheck as director-v2 is already taking care of it
# in oder to have similar performance a more aggressive healethcek
# would be required.
# removing the healthchek would not cause any issues at this point
# NOTE: When adding a healthcheck
# - remove UpdateHealth - no longer required
# - remove WaitForSidecarAPI - no longer required
# - After `get_dynamic_sidecar_placement` inside CreateSidecars call
# (the sidecar's API will be up and running; guaranteed by the docker engine healthck).
# Add the following line `scheduler_data.dynamic_sidecar.is_ready = True`
# The healthcheck guarantees that the API is available
HEALTHCHECK NONE
and here the code from the director-v2
async def _get_task_data_when_service_running(service_id: str) -> Mapping[str, Any]:
"""
Waits for dynamic-sidecar task to be `running` and returns the
task data.
"""
task = await _get_service_latest_task(service_id)
service_state = task["Status"]["State"]
if service_state not in TASK_STATES_RUNNING:
raise TryAgain
return task
task = await _get_task_data_when_service_running(service_id=service_id)
These is how it looks like in portainer:
So from here I will create one issue for the dynamic-sidecar. I think it just needs a docker Healthcheck like every other service we use.
I am going to test what happens when I restart a dynamic sidecar on an autoscaled node, as this might also break
@matusdrobuliak66 @GitHK there is a second issue. If a dynamic-sidecar runs on an autoscaled node (our 80% use-case now), and someone restarts the dynamic-sidecar, the very same will happen. I guess I will need to create some spaghetti code for this. will create the issue and find a solution.
The final issue that is still not resolved was moved to the following issue:
Which deploy/s?
tip.science
Issues
I see 3 issues/observations:
Details of each issue (dropdown)
1. Sidecars can not connect to storage/rabbitmq
``` log_level=ERROR | log_timestamp=2024-05-15 07:51:45,306 | log_source=uvicorn.error:send(121) | log_uid=None | log_msg=Traceback (most recent call last):\n File "/home/scu/.venv/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan\n async with self.lifespan_context(app) as maybe_state:\n File "/home/scu/.venv/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__\n await self._router.startup()\n File "/home/scu/.venv/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup\n await handler()\n File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_dynamic_sidecar/core/external_dependencies.py", line 35, in on_startup\n raise CouldNotReachExternalDependenciesError(failed=failed)\nsimcore_service_dynamic_sidecar.core.external_dependencies.CouldNotReachExternalDependenciesError: Could not start because the following external dependencies failed: ["Could not contact service 'RabbitMQ' at 'amqp://...@production_rabbit:5672'. Look above for details.", "Could not contact service 'Storage' at 'http://production_storage:8080/v0/'. Look above for details."]\n ``` ``` log_level=WARNING | log_timestamp=2024-05-15 07:51:44,292 | log_source=simcore_service_dynamic_sidecar.modules.service_liveness:log_it(30) | log_uid=None | log_msg=Retrying (attempt 30) to contact 'RabbitMQ' at 'amqp://...@production_rabbit:5672' in 1.0 seconds. ``` What is different in this deployment is the DNS override. But I am not sure whether it is the issue causing this, I would even argue that it should not be the cause of it.2. Sidecar doesn't start (REJECTED)
1. firstly failed (can not connect to rabbitmq/storage) 2. then rejected 3. then pending 4. then autoscaling removes the node 5. hangs there in a pending state ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/a0138c69-c473-4c1c-a7db-87c6a990b0ea) ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/5ac0544a-649c-4ba1-95b6-56a58974efbb) ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/87a2a3df-fe42-4498-ab45-ed8d58868eb8)3. Sidecars don't start. Another sidecar steals the node.
This issue happened once, but I am not able to reproduce it anymore now. (I guess it might be rare) ![image](https://github.com/ITISFoundation/osparc-simcore/assets/60785969/8d8f9ca4-26b1-42bb-99a9-01b6e1eb40ac)