Particular / NServiceBus.RabbitMQ

RabbitMQ transport for NServiceBus
https://docs.particular.net/nservicebus/rabbitmq/
Other
88 stars 58 forks source link

Previous connections stay active after reconnect, causing resource exhaustion #1435

Closed danielmarbach closed 3 weeks ago

danielmarbach commented 3 weeks ago

Symptoms

Network connections between the client and the RabbitMQ broker may be interrupted. During interruptions the transport attempts to reconnect to the broker. During reconnection attempts, previous connections may stay open and eventually lead to resource exhaustion on the client, which requires the client to be restarted.

Who's affected

All users are affected.

Root cause

Previous connections are not properly disposed and there are a number of race conditions that can lead to new connections being opened that remain active even after the transport is already shutdown.

Backported to

Original content

Replaces:

PR that brings the small adjustments that are needed to test things plus the tests to master to reproduce the problem on master

image

This PR fixes #1268 and quite likely eventually:

Changes required to make the code more testable

The PR attempts to make good enough tradeoffs between not introducing too much test induced damage while introducing test that allow to simulate various races that can occur.

Given the reconnection logic is not really on the super critical hot path the above changes seem to be a good tradeoff

Discussed various ideas with @bording and @andreasohlund that also included things like the time provider, more interfaces around connection factory etc. which also comes with various issues. For example, the NET8 version can use the time provider but then backports would not be able to and even with the time provider in place you would still have to have a mechanism to intercept the offloading of the reconnection loop to synchronize the test in various places.

Approach

Instead of directly accessing and overriding the connection field, the new code creates new connections and atomically swaps the old connection out when the new connection is ready.

This approach has been chosen to preserve one of previous design choices predating this change.

The decision to allow publish attempts to fail when the connection is down is deliberate and aligns with the overall strategy of balancing reliability and consistency in the system. The reasoning here is that it’s quite rare for the publish connection to be down while the message pump connection remains operational, so letting the publish fail allows the system to handle the failure more predictably. By doing this, we can lean on NServiceBus’s recoverability mechanisms, like retries and the broker’s ability to resend messages, rather than introducing complex logic to handle these failures internally.

Blocking the send to wait for the connection to re-establish was ruled out because it could lead to ghost messages. These would occur if a handler blocks on sending, but then the message pump connection is reset, leading to a situation where the message is sent, but the handler fails. This creates an inconsistent state that is harder to manage. Instead, letting the failure bubble up immediately aligns better with the overall system design, where failures are clear, and the recovery path is well-defined.

Additionally, outside of handlers, it makes more sense to let the application’s existing exception handling manage these cases rather than attempting to mask the failure with retries or delays. This keeps the system behavior transparent and predictable, avoiding hidden complexities that could lead to harder-to-diagnose issues later on.