Reliable failover? - Githubissues

erothmayer commented 5 years ago

I've followed the example in the Reconnect sample. However, when I test failover some of the time I'm finding that I get a series of inputs which don't end with a reconnect.

Specifically, sometimes I get a connection_error event with a message like ConnectionError: Could not process AMQP commands, followed by a connection_closed event. I do not get a disconnected event, and Rhea does not try to reconnect.

I believe this is because Rhea treats the error as a fatal error and it chooses not to reconnect in this case. However, this is thrown by the broker in the process of failing over / rebooting, and so the client should still try to reconnect and cycle through hosts provided by connection_details.

So my question is, how can I best achieve this? It looks like I could set non_fatal_errors to an empty list, and manually catch the connection_error / connection_closed event to reconnect. In that case, can I just use connection.connect() on the existing connection? Are there any risks in terms of re-creating links?

Alternatively, I could override Connection.is_fatal() to always return false and let the internal reconnect handle it.

Either way, I'm wondering whether this is something that could be improved in Rhea. It seems like the current implementation relies on a 'happy path' of failover behavior which isn't always the case in practice. (At least with Amazon MQ, which is where I'm testing)

grs commented 5 years ago

You can add the error condition to the list of non_fatal_errors or you can handle the reconnect yourself as you say.

The assumption in the code is that an explicit close is different from a disconnection (i.e. from a loss of the underlying socket). The non_fatal_errors are those conditions for which automatic reconnection is desireable. That list is just ['amqp:connection:forced' ] by default, but can be configured to include other conditions as desired.

erothmayer commented 5 years ago

Thanks for the quick response! I ended up doing roughly the following:

let connection = container.create_connection(opts);
connection.on('disconnected', () => {
  setTimeout(() => connection.reconnect(), 1000);
});
connection.connect();

What I've got above is working for now, but could definitely be improved. It effectively bypasses the backoff features in the library. Ideally I would prefer to alter the determination of what errors are fatal to just say "none are", but this doesn't seem possible with the current whitelist approach. I don't know what error might be thrown tomorrow that might be classified as fatal, so I don't think the list approach is workable for my case today.

What would you think of accepting a user function is_fatal_error/is_non_fatal_error, rather than a static list of non-fatal errors? That way clients could reliably hardcode an answer if that's what they prefer, rather than having to know all possible errors in advance?

amarzavery commented 5 years ago

What you are doing should definitely work. Another approach is to manage reconnect and thereafter the state of links and sessions yourself. I have implemented something on those lines for Azure EventHubs over here.

Basically listen for the "disconnected" event. If you maintain a map of sender/receiver links that you have created then you can reconnect them once the connection is opened again.

erothmayer commented 5 years ago

@amarzavery interesting -- what's the advantage of maintaining the links externally? When I tested the above, it seemed like Rhea was persisting them internally and re-creating all of them automatically after a reconnect without me having to do anything extra...

grs commented 5 years ago

@erothmayer Yes, I agree some way to indicate all connection errors should trigger reconnect would be a worthwhile improvement.

amarzavery commented 5 years ago

@erothmayer - My design needed having each link (sender/receiver) on a dedicated session. I wasn't able to get reconnect working in that scenario. More over Azure implements it's custom CBS (claims based auth service) Auth, that needs to be the first thing that needs to happen while creating any sender/receiver link. Hence maintaining them outside of rhea worked out well for me.

erothmayer commented 5 years ago

Got it, thanks!

I will take a look at PR based on the discussion above.

grs commented 5 years ago

Closed in error, sorry!

amqp / rhea

Reliable failover? #152