Bubble channel failures up to the connection

DavidWittman commented 3 years ago

Send channel failures up to the connection so that they can be handled appropriately. Previously, errors on channels caused silent failures which could result in queues being unable to receive messages without any indication of failure, such as described in https://github.com/arobson/rabbot/issues/155.

This approach is a bit naive in that it causes the whole connection to fail when an a single channel fails, but I don't think the abstractions in this library allow for the user to respond to channel-level events.

In my case, this type of failure occurred any time my consumers lost their connection to the RabbitMQ cluster and tried to send an ack for an old delivery tag to a different RMQ node in the cluster (i.e. queue is re-established on new node and batch acks send after reconnecting). This would result in the following error in amqplib:

Error: Channel closed by server: 406 (PRECONDITION-FAILED) with message "PRECONDITION_FAILED - unknown delivery tag 42"

But from the perspective of foo-foo-mq/rabbot, everything was fine. With these changes in place, I can at least listen for a failed event on the broker and respond appropriately.

DavidWittman commented 3 years ago

I'm not very familiar with this code base, so I'd appreciate any advice you could give me toward resolving the build failures. Or if I'm way off base and this is better suited as an issue I'm happy to just open an issue instead.

zlintz commented 3 years ago

@DavidWittman are you able to provide steps to consistently reproduce? Are you actually loosing messages?

Based on the description you provided, it sounds like the scenario in question is resulting in an error and ultimately a nack from the listener that is attempting to process the message. This would result in the message going back on the queue to be picked back up to process. Is this not the case for you?

As for tests, this proposal is causing some other tests to fail as you are seeing in the details link -> https://travis-ci.org/github/Foo-Foo-MQ/foo-foo-mq/builds/710152398?utm_source=github_status&utm_medium=notification and more specifically https://travis-ci.org/github/Foo-Foo-MQ/foo-foo-mq/jobs/710152399

I would expect that PR's that are opened to have corresponding tests and existing tests should not be failing and/or updated accordingly to match the updated behavior.

As for if this should be an issue or a pull request is up to whether you want to help fix it. A PR is the fastest way to get any fixes in and an open issue without steps to reproduce will likely not get much attention.

We have some guidelines to contribute please read through https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/master/HOW_TO_CONTRIBUTE.md

We have these guidelines for submitting PR's around things such as format of the commit message to help with auto generation of the CHANGELOG.md file and that tests are required and to be passing. In the current state given it doesn't meet some of these guidelines we won't consider merging.

DavidWittman commented 3 years ago

Are you able to provide steps to consistently reproduce?

Yes, sorry. I should have explained that better up front. Here's an overview of what happens:

RabbitMQ: Cluster of 3 RabbitMQ servers behind a Load Balancer (AWS ELB in this case) Consumer: Subscribed to an exclusive queue (exclusive queues are never mirrored in RabbitMQ clustering)

Consumer receives a message, processes it, and calls .ack(). Batch acks are enabled by default, so this doesn't immediately send the ack.
Some network event happens which causes the consumer to get temporarily disconnected. Networks are unreliable, it happens. During this time, the acks from Step 1 still have not been sent.
Consumer is now connected to the queue on a different RabbitMQ node in the cluster
The batch acks from Step 1 are sent. This causes a channel error in RabbitMQ because it does not recognize the delivery tag which it just received.
Consumer has silently failed and no longer receives messages on the channel.

RabbitMQ logs

Node A

020-07-10 04:26:20.545 [warning] <0.27062.317> closing AMQP connection <0.27062.317> (172.29.22.181:58856 -> 172.29.19.214:5672, vhost: '/', user: 'alice'):
client unexpectedly closed TCP connection
2020-07-10 04:26:20.546 [warning] <0.22429.327> closing AMQP connection <0.22429.327> (172.29.22.181:12110 -> 172.29.19.214:5672, vhost: '/', user: 'alice'):
client unexpectedly closed TCP connection
2020-07-10 04:26:20.547 [warning] <0.30041.321> closing AMQP connection <0.30041.321> (172.29.22.181:55682 -> 172.29.19.214:5672, vhost: '/', user: 'alice'):
client unexpectedly closed TCP connection

Node B

2020-07-10 04:26:20.579 [info] <0.28750.325> accepting AMQP connection <0.28750.325> (172.29.19.198:19494 -> 172.29.21.226:5672)
2020-07-10 04:26:20.603 [info] <0.28750.325> connection <0.28750.325> (172.29.19.198:19494 -> 172.29.21.226:5672): user 'alice' authenticated and granted access to v
host '/'
2020-07-10 04:26:20.645 [info] <0.28804.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Adding mirror on node 'rabbit@ip-172-29-12-43.ec2.internal'
: <43793.22811.330>
2020-07-10 04:26:20.657 [info] <0.28804.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Synchronising: 0 messages to synchronise
2020-07-10 04:26:20.657 [info] <0.28804.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Synchronising: batch size: 4096
2020-07-10 04:26:20.660 [info] <0.28805.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Synchronising: all slaves already synced
2020-07-10 04:26:21.109 [error] <0.28746.325> Channel error on connection <0.28750.325> (172.29.19.198:19494 -> 172.29.21.226:5672, vhost: '/', user: 'alice'), chann
el 6:
operation basic.ack caused a channel exception precondition_failed: unknown delivery tag 286479

As for the "consistently reproducing" part, that's a little more difficult because it's a somewhat-complex distributed system and we're trying to mimic a network failure. However, I was able to trigger the same behavior (4 and 5 above) just by modifying foo-foo-mq to send a bad delivery tag down a channel to a single RabbitMQ instance.

Are you actually loosing messages?

Losing messages? No. But the consumers effectively seize up and stop consuming from their queues when this happens.

Based on the description you provided, it sounds like the scenario in question is resulting in an error and ultimately a nack from the listener that is attempting to process the message. This would result in the message going back on the queue to be picked back up to process. Is this not the case for you?

This assumption is incorrect. My consumer has already acked the message, and it never receives an error.

zlintz commented 3 years ago

Thanks for the reply, logs, and description. I will look a little more at this as time allows and reach back out if I have further questions.

Feel free to add additional context as you see fit that may be useful

zlintz commented 1 year ago

@DavidWittman As you can see time has not been on our side. Wondering if this is still an issue and/or if you are still using this library? I would be interested in any information you have regarding this.

If you feel this is not an issue still or found a solution I would like to know in order to address this issue accordingly or close it.

DavidWittman commented 1 year ago

Still an issue, and we're still using this library. We have seen the issue in production on three distinct RabbitMQ HA architectures (including Amazon MQ).

We're currently working around the bug with a watchdog thread which restarts the services when they can no longer read from RabbitMQ. Honestly I'd like to spend more time resolving this particular issue so we can get rid of that, but it's a pain in the ass to reproduce and I've done a poor job of explaining the issue up to this point.

doubliez commented 8 months ago

This is still an issue (I've been tracking this PR for a while) and seems to happen more frequently under high load.

It's a big problem for us that in this case, foo-foo-mq just stops consuming from a queue silently (when the 406 (PRECONDITION_FAILED) error happens).

I tested the simple code change you have in this PR @DavidWittman, and was able to confirm that it fixes the issue. It forces foo-foo-mq to reconnect to RabbitMQ and start consuming from the queue again. Unfortunately the error is hard to reproduce as you pointed out, and I didn't bother trying to reproduce in a dev environment. Instead, I just deployed the change to our live environment and waited for the error to happen again (it happens every couple days in one of our services that constantly receives 3k messages per second or more).

@zlintz Any chance this PR could be merged so that we don't have to use our own fork of foo-foo-mq? Do you have any concerns with this change?

zlintz commented 7 months ago

I don't think I have concerns based on the change. If you/others are confident in the change as well I don't see this as an issue and I will get it added. It does need to follow the conventional commits though https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/master/HOW_TO_CONTRIBUTE.md

Maybe just rebase your branch and use npm run commit. This should help

DavidWittman commented 7 months ago

I rebased and switched to conventional commit logs, but npm run commit wasn't working for me because my changes were already staged/committed. Hopefully that's not a problem.

DavidWittman commented 5 months ago

@zlintz nudge :)

zlintz commented 5 months ago

@DavidWittman seems like this is breaking tests. Unfortunately I am going to have revert this for the moment until it can pass the tests.

Please see https://github.com/Foo-Foo-MQ/foo-foo-mq/actions/runs/7660491629/job/20877897843 I may try to dig into why as well.

zlintz commented 5 months ago

@DavidWittman @doubliez I think I am approaching the answer. It looks like the this.handle("failed"); is the problem. The code will already handle the "failed" as a result of the this.transition("failed");

For more context if you look here https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/main/src/connectionFsm.js#L192 during the state transition _replay is called which in turn does the this.handle

So what appears to be the problem relative to the tests was multiple #.connection.failed events were getting emitted when only one was necessary.

For anyone curious to try this out put a .only on the mocha tests in badConnection.spec.js and addPassiveQueue.spec.js then put back in the commented out code here https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/main/src/connectionFsm.js#L71-L75 and run the tests. They will fail. But if you remove the this.handle('failed'); the tests will pass.

What I could use confirmation from you all is any additional context of why the this.handle('failed'); was added as part of this PR. Were you not getting the event to listen to just off of the transition?

Foo-Foo-MQ / foo-foo-mq