eiffel-community / eiffel-remrem-publish

eiffel-remrem-publish
https://eiffel-community.github.io/eiffel-remrem-publish
Apache License 2.0
8 stars 77 forks source link

Channel exceptions don't trigger a recovery #244

Closed magnusbaeck closed 1 year ago

magnusbaeck commented 2 years ago

Description

If the AMQP channel experiences an exception REMReM Publish won't attempt to tear down and reconnect to restore a working channel for publishing events. Instead it'll just return HTTP 500 errors to clients and log the following:

com.rabbitmq.client.AlreadyClosedException: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=403, reply-text=ACCESS_REFUSED - access to exchange 'eiffel.public' in vhost '/' refused for user 'svceiffelremrem', backend rabbit_auth_backend_ldap returned an error: ldap_connect_error
, class-id=60, method-id=40)
        at com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(AMQChannel.java:253) ~[amqp-client-5.4.0.jar:5.4.0]
        at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:422) ~[amqp-client-5.4.0.jar:5.4.0]
        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:704) ~[amqp-client-5.4.0.jar:5.4.0]
        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:679) ~[amqp-client-5.4.0.jar:5.4.0]
        at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:669) ~[amqp-client-5.4.0.jar:5.4.0]
        at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.basicPublish(AutorecoveringChannel.java:192) ~[amqp-client-5.4.0.jar:5.4.0]
        at com.ericsson.eiffel.remrem.publish.helper.RabbitMqProperties.send(RabbitMqProperties.java:427) ~[publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.helper.RMQHelper.send(RMQHelper.java:92) ~[publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.service.MessageServiceRMQImpl.sendMessage(MessageServiceRMQImpl.java:211) [publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.service.MessageServiceRMQImpl.send(MessageServiceRMQImpl.java:67) [publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.service.MessageServiceRMQImpl.send(MessageServiceRMQImpl.java:119) [publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.service.MessageServiceRMQImpl.send(MessageServiceRMQImpl.java:156) [publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.service.MessageServiceRMQImpl.send(MessageServiceRMQImpl.java:94) [publish-common-2.0.19-3-g6b9e474.jar:na]
        at com.ericsson.eiffel.remrem.publish.controller.ProducerController.generateAndPublish(ProducerController.java:203) [classes/:na]
...

Note what the RabbitMQ Java Client API Guide says about automatic recovery of bad connections and channels:

Automatic connection recovery, if enabled, will be triggered by the following events:

  • An I/O exception is thrown in connection's I/O loop
  • A socket read operation times out
  • Missed server heartbeats are detected
  • Any other unexpected exception is thrown in connection's I/O loop

whichever happens first.

...

Channel-level exceptions will not trigger any kind of recovery as they usually indicate a semantic issue in the application (e.g. an attempt to consume from a non-existent queue).

In the example above the channel exception was caused by a temporary problem with the LDAP server that RabbitMQ used for authentication and authorization. That caused publish commands from REMReM Publish to fail, and each failure caused the calling channel to end up in a bad state from which it wouldn't recover even when RabbitMQ was fine.

The service could be restored by restarting it or forcing the closure of the AMQP connection.

We experienced this problem with commit 6b9e474 but AFAICT from reading the code the problem is still around.

Motivation

It's fine that REMReM Publish rejects publish requests from clients if RabbitMQ rejects the publish command but once RabbitMQ recovers then so should REMRem Publish.

Exemplification

We had >1100 builds fail because of a six second long LDAP outage that required manual invention on the Eiffel REMReM Publish side to recover from.

Benefits

A service that's more resilient against errors.

Possible Drawbacks

None.

jainadc9 commented 2 years ago

Hi Magnus, We found a solution but we are unable to verify it can you please provide steps to replicate the issue.

Regards, Jainad

magnusbaeck commented 2 years ago

In essence the publish operation failed, resulting in the channel being closed, and then REMReM Publish couldn't recover. In our case the failure was caused by a permission error and it's possible that you can reproduce the problem simply by attempting to publish with a user that doesn't have write permission to the exchange (you may have to revoke that permission after starting REMReM Publish). Once the write permission has been granted again the desired behavior is, of course, that a subsequent publish request to REMReM Publish causes the connection and channel to get reestablished and the event to get published.

vishnu-alapati commented 1 year ago

Solution: We don't have channel recovery in RabbitMQ, so implemented the channel pool mechanism, added the channel to the pool and checking channel is open or not before publishing the messages. If the channel is not available/open new channel will be created and added to the pool...

Made the necessary changes, tested properly.

jainadc9 commented 1 year ago

PR:https://github.com/eiffel-community/eiffel-remrem-publish/pull/263