ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
43 stars 27 forks source link

Investigate heartbeat timeout with RabbitMQ 🚨 #4301

Closed sanderegg closed 1 year ago

sanderegg commented 1 year ago

On on-premise deployments it seems there are network inconsistencies that lead to the RabbitMQ server to disconnect clients that do not respond to heartbeats. The issue seems to happen now several times since last week Investigation needed to see if it is possible to make the webserver to auto-recover?

matusdrobuliak66 commented 1 year ago

This issue is problem because when this occurs users are not able to access studies!

image

Few observations from 2.6. morning:

matusdrobuliak66 commented 1 year ago

Even after Friday's maintenance of Dalco, I observed similar behavior during the last weekend. It happened 2 times (restart of the webserver temporarily fixed the problem)

sanderegg commented 1 year ago

Here is the summary of what is happening.

This can be reproduced like so:

-->

matusdrobuliak66 commented 1 year ago

The same problem occurred at TIP production deployment 12.6.-13.6. When we will do the next release, the https://github.com/ITISFoundation/osparc-simcore/pull/4316 should be included.

sanderegg commented 1 year ago

@matusdrobuliak66 is this still a thing? or shall we now close this issue?

matusdrobuliak66 commented 1 year ago

It seems this is not a problem anymore, closing the issue.