Closed sanderegg closed 1 year ago
This issue is problem because when this occurs users are not able to access studies!
Few observations from 2.6. morning:
[38;5;160m2023-06-01 20:32:52.661649+00:00 [error] <0.29215.4> missed heartbeats from client, timeout: 60s[0m
Failed to reopen channel
aiormq.exceptions.ChannelNotFoundEntity: NOT_FOUND - no queue 'amq_0xdcc3610d60705c149605566069a84e21' in vhost '/'
Even after Friday's maintenance of Dalco, I observed similar behavior during the last weekend. It happened 2 times (restart of the webserver temporarily fixed the problem)
Here is the summary of what is happening.
robust_connect
method on the aio-pika library --> this is supposed to auto-reconnectThis can be reproduced like so:
make up-prod
docker pause
the webserver. wait 2 minutes until the rabbitmq server shows the error that the client is missing the heartbeats.docker unpause
the webserver and everything is broken-->
add_topics
to the TOPIC queues, this fails as the previously saved queue name does not exists anymore (same with remove_topcs
)The same problem occurred at TIP production deployment 12.6.-13.6. When we will do the next release, the https://github.com/ITISFoundation/osparc-simcore/pull/4316 should be included.
@matusdrobuliak66 is this still a thing? or shall we now close this issue?
It seems this is not a problem anymore, closing the issue.
On on-premise deployments it seems there are network inconsistencies that lead to the RabbitMQ server to disconnect clients that do not respond to heartbeats. The issue seems to happen now several times since last week Investigation needed to see if it is possible to make the webserver to auto-recover?