Investigate heartbeat timeout with RabbitMQ 🚨

sanderegg commented 1 year ago

On on-premise deployments it seems there are network inconsistencies that lead to the RabbitMQ server to disconnect clients that do not respond to heartbeats. The issue seems to happen now several times since last week Investigation needed to see if it is possible to make the webserver to auto-recover?

matusdrobuliak66 commented 1 year ago

This issue is problem because when this occurs users are not able to access studies!

Few observations from 2.6. morning:

The last issue started 1.6. around 20:30pm (UTC), for both RabbitMQ in staging and production running on osparc-dalco-01: https://monitoring.osparc.speag.com/graylog/search/646f22bf94a91c3d2080a453?q=container_name%3A%2Fproduction%5C-simcore_production_rabbit.%2A%2F+AND+message%3Aerror&rangetype=absolute&from=2023-06-01T20%3A19%3A26.094Z&to=2023-06-01T20%3A33%3A29.000Z
```
[38;5;160m2023-06-01 20:32:52.661649+00:00 [error] <0.29215.4> missed heartbeats from client, timeout: 60s[0m
```

Right away all webservers starts to complain (ex. staging: all three containers, running in osparc-dalco-08 & osparc-dalco-09

Failed to reopen channel
aiormq.exceptions.ChannelNotFoundEntity: NOT_FOUND - no queue 'amq_0xdcc3610d60705c149605566069a84e21' in vhost '/'

Error were happening from 20:30 till 4:30 (UTC) until we restarted the webserver services

matusdrobuliak66 commented 1 year ago

Even after Friday's maintenance of Dalco, I observed similar behavior during the last weekend. It happened 2 times (restart of the webserver temporarily fixed the problem)

3.6. (0:20 - 8:00 UTC) - 2am-10am CET
4.6 (4:30 - 8:00 UTC) - 6:30-10am CET

sanderegg commented 1 year ago

Here is the summary of what is happening.

RabbitMQ server <--> RabbitMQ client have a defined heartbeat (default 60s). When 2x the heartbeat is met with a timeout on the server side, the server side closes the connection of the missing client (that means no network for about 2 minutes)
We use the robust_connect method on the aio-pika library --> this is supposed to auto-reconnect
We use exclusive queues in the case of logs/progress/events as there is one queue per webserver instances
The exclusive queue have a unique name defined when declared on the client side (1 per webserver)
The exclusive queues are deleted when the connection is closed
The consumers on the queues are also deleted when the connection is closed

This can be reproduced like so:

make up-prod
docker pause the webserver. wait 2 minutes until the rabbitmq server shows the error that the client is missing the heartbeats.
docker unpause the webserver and everything is broken

-->

when the webserver tries to add_topics to the TOPIC queues, this fails as the previously saved queue name does not exists anymore (same with remove_topcs)
the consumers are also stopped --> the whole feedback through the rabbitMQ is stopped until the webserver is restarted

matusdrobuliak66 commented 1 year ago

The same problem occurred at TIP production deployment 12.6.-13.6. When we will do the next release, the https://github.com/ITISFoundation/osparc-simcore/pull/4316 should be included.

sanderegg commented 1 year ago

@matusdrobuliak66 is this still a thing? or shall we now close this issue?

matusdrobuliak66 commented 1 year ago

It seems this is not a problem anymore, closing the issue.

ITISFoundation / osparc-simcore

Investigate heartbeat timeout with RabbitMQ 🚨 #4301