Azure / amqpnetlite

AMQP 1.0 .NET Library
Apache License 2.0
401 stars 143 forks source link

NotifyClosed not being called on connection loss detected by HeartBeat because state remains in ConnectionState.CloseSent #460

Closed ldruschk closed 3 years ago

ldruschk commented 3 years ago

I am trying to detect connection losses by using the HeartBeat. To simulate the connection loss during test I use iptables to block all incoming and outgoing traffic on the server. I have configured the HeartBeat using factory.AMQP.IdleTimeout = 5000.

After adding some debug prints in Connection.cs and AmqpObject.cs, I found out that doing this leads to the HeartBeat properly detecting the connection loss and calling closeInternal here: https://github.com/Azure/amqpnetlite/blob/762616b2244433495c09e35d8f0d54cb8415bf70/src/Connection.cs#L883

However, the callback registered using AddClosedCallback is never called.

The source code of closeInternal is here: https://github.com/Azure/amqpnetlite/blob/762616b2244433495c09e35d8f0d54cb8415bf70/src/AmqpObject.cs#L181

I also added a print to OnClose, which returns false because the ConnectionState is CloseSent and not End. However, since the server is "dead", the client will never receive a response and thus remains in the state CloseSent without ever notifying the client application.

I am not sure if this is intended behavior, since it could be argued that the connection is not considered closed unless both, the client and the server, have sent a close frame. However, this raises the question how a connection breakdown such as this is intended to be noticed and resolved by the client application. As far as I can tell, there is no relevant event/callback which would notify the client application about the state switching to CloseSent, at which point a new connection would need to be established for any future messages anyways regardless of whether the server does acknowledge the close or not.

I would appreaciate any pointers towards the intended way to deal with such scenarios.

xinchen10 commented 3 years ago

In this case, the socket never failed the I/O operation? The assumption is that there will be a response from the remote peer or there will be a I/O exception from the socket. Maybe the connection should have a time limit for waiting for the response and abort everything if it is not received within that time.

ldruschk commented 3 years ago

I ran some more tests and here is what I observed:

The SendAsync function on the SenderLink using the broken connection throws a System.TimeoutException after one minute, however, this does not trigger the OnClosed callback.

The OnClosed-Callback is finally triggered after over 16 minutes:

Exception while receiving message: Amqp.AmqpException: Unable to read data from the transport connection: Connection timed out.

I would assume this is related to the OS-specific TCP-settings, this was tested on the current Arch Linux version without any modifications made to the TCP parameters.

So while the original issue is not 100% correct since the callback is eventually called, I believe something like a time limit on waiting for the response like you suggest would still be a good idea.

xinchen10 commented 3 years ago

Added a 20 seconds timer for waiting for response.