Closed DavidWittman closed 5 months ago
I'm not very familiar with this code base, so I'd appreciate any advice you could give me toward resolving the build failures. Or if I'm way off base and this is better suited as an issue I'm happy to just open an issue instead.
@DavidWittman are you able to provide steps to consistently reproduce? Are you actually loosing messages?
Based on the description you provided, it sounds like the scenario in question is resulting in an error and ultimately a nack
from the listener that is attempting to process the message. This would result in the message going back on the queue to be picked back up to process. Is this not the case for you?
As for tests, this proposal is causing some other tests to fail as you are seeing in the details link -> https://travis-ci.org/github/Foo-Foo-MQ/foo-foo-mq/builds/710152398?utm_source=github_status&utm_medium=notification and more specifically https://travis-ci.org/github/Foo-Foo-MQ/foo-foo-mq/jobs/710152399
I would expect that PR's that are opened to have corresponding tests and existing tests should not be failing and/or updated accordingly to match the updated behavior.
As for if this should be an issue or a pull request is up to whether you want to help fix it. A PR is the fastest way to get any fixes in and an open issue without steps to reproduce will likely not get much attention.
We have some guidelines to contribute please read through https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/master/HOW_TO_CONTRIBUTE.md
We have these guidelines for submitting PR's around things such as format of the commit message to help with auto generation of the CHANGELOG.md file and that tests are required and to be passing. In the current state given it doesn't meet some of these guidelines we won't consider merging.
Are you able to provide steps to consistently reproduce?
Yes, sorry. I should have explained that better up front. Here's an overview of what happens:
RabbitMQ: Cluster of 3 RabbitMQ servers behind a Load Balancer (AWS ELB in this case) Consumer: Subscribed to an exclusive queue (exclusive queues are never mirrored in RabbitMQ clustering)
.ack()
. Batch acks are enabled by default, so this doesn't immediately send the ack.020-07-10 04:26:20.545 [warning] <0.27062.317> closing AMQP connection <0.27062.317> (172.29.22.181:58856 -> 172.29.19.214:5672, vhost: '/', user: 'alice'):
client unexpectedly closed TCP connection
2020-07-10 04:26:20.546 [warning] <0.22429.327> closing AMQP connection <0.22429.327> (172.29.22.181:12110 -> 172.29.19.214:5672, vhost: '/', user: 'alice'):
client unexpectedly closed TCP connection
2020-07-10 04:26:20.547 [warning] <0.30041.321> closing AMQP connection <0.30041.321> (172.29.22.181:55682 -> 172.29.19.214:5672, vhost: '/', user: 'alice'):
client unexpectedly closed TCP connection
2020-07-10 04:26:20.579 [info] <0.28750.325> accepting AMQP connection <0.28750.325> (172.29.19.198:19494 -> 172.29.21.226:5672)
2020-07-10 04:26:20.603 [info] <0.28750.325> connection <0.28750.325> (172.29.19.198:19494 -> 172.29.21.226:5672): user 'alice' authenticated and granted access to v
host '/'
2020-07-10 04:26:20.645 [info] <0.28804.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Adding mirror on node 'rabbit@ip-172-29-12-43.ec2.internal'
: <43793.22811.330>
2020-07-10 04:26:20.657 [info] <0.28804.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Synchronising: 0 messages to synchronise
2020-07-10 04:26:20.657 [info] <0.28804.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Synchronising: batch size: 4096
2020-07-10 04:26:20.660 [info] <0.28805.325> Mirrored queue 'broker_9eefdb9fdfd50.response' in vhost '/': Synchronising: all slaves already synced
2020-07-10 04:26:21.109 [error] <0.28746.325> Channel error on connection <0.28750.325> (172.29.19.198:19494 -> 172.29.21.226:5672, vhost: '/', user: 'alice'), chann
el 6:
operation basic.ack caused a channel exception precondition_failed: unknown delivery tag 286479
As for the "consistently reproducing" part, that's a little more difficult because it's a somewhat-complex distributed system and we're trying to mimic a network failure. However, I was able to trigger the same behavior (4 and 5 above) just by modifying foo-foo-mq to send a bad delivery tag down a channel to a single RabbitMQ instance.
Are you actually loosing messages?
Losing messages? No. But the consumers effectively seize up and stop consuming from their queues when this happens.
Based on the description you provided, it sounds like the scenario in question is resulting in an error and ultimately a nack from the listener that is attempting to process the message. This would result in the message going back on the queue to be picked back up to process. Is this not the case for you?
This assumption is incorrect. My consumer has already acked the message, and it never receives an error.
Thanks for the reply, logs, and description. I will look a little more at this as time allows and reach back out if I have further questions.
Feel free to add additional context as you see fit that may be useful
@DavidWittman As you can see time has not been on our side. Wondering if this is still an issue and/or if you are still using this library? I would be interested in any information you have regarding this.
If you feel this is not an issue still or found a solution I would like to know in order to address this issue accordingly or close it.
Still an issue, and we're still using this library. We have seen the issue in production on three distinct RabbitMQ HA architectures (including Amazon MQ).
We're currently working around the bug with a watchdog thread which restarts the services when they can no longer read from RabbitMQ. Honestly I'd like to spend more time resolving this particular issue so we can get rid of that, but it's a pain in the ass to reproduce and I've done a poor job of explaining the issue up to this point.
This is still an issue (I've been tracking this PR for a while) and seems to happen more frequently under high load.
It's a big problem for us that in this case, foo-foo-mq just stops consuming from a queue silently (when the 406 (PRECONDITION_FAILED)
error happens).
I tested the simple code change you have in this PR @DavidWittman, and was able to confirm that it fixes the issue. It forces foo-foo-mq to reconnect to RabbitMQ and start consuming from the queue again. Unfortunately the error is hard to reproduce as you pointed out, and I didn't bother trying to reproduce in a dev environment. Instead, I just deployed the change to our live environment and waited for the error to happen again (it happens every couple days in one of our services that constantly receives 3k messages per second or more).
@zlintz Any chance this PR could be merged so that we don't have to use our own fork of foo-foo-mq? Do you have any concerns with this change?
I don't think I have concerns based on the change. If you/others are confident in the change as well I don't see this as an issue and I will get it added. It does need to follow the conventional commits though https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/master/HOW_TO_CONTRIBUTE.md
Maybe just rebase your branch and use npm run commit
. This should help
I rebased and switched to conventional commit logs, but npm run commit
wasn't working for me because my changes were already staged/committed. Hopefully that's not a problem.
@zlintz nudge :)
@DavidWittman seems like this is breaking tests. Unfortunately I am going to have revert this for the moment until it can pass the tests.
Please see https://github.com/Foo-Foo-MQ/foo-foo-mq/actions/runs/7660491629/job/20877897843 I may try to dig into why as well.
@DavidWittman @doubliez I think I am approaching the answer. It looks like the this.handle("failed");
is the problem. The code will already handle the "failed" as a result of the this.transition("failed");
For more context if you look here https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/main/src/connectionFsm.js#L192 during the state transition _replay
is called which in turn does the this.handle
So what appears to be the problem relative to the tests was multiple #.connection.failed
events were getting emitted when only one was necessary.
For anyone curious to try this out put a .only
on the mocha tests in badConnection.spec.js and addPassiveQueue.spec.js then put back in the commented out code here https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/main/src/connectionFsm.js#L71-L75 and run the tests. They will fail. But if you remove the this.handle('failed');
the tests will pass.
What I could use confirmation from you all is any additional context of why the this.handle('failed');
was added as part of this PR. Were you not getting the event to listen to just off of the transition?
Send channel failures up to the connection so that they can be handled appropriately. Previously, errors on channels caused silent failures which could result in queues being unable to receive messages without any indication of failure, such as described in https://github.com/arobson/rabbot/issues/155.
This approach is a bit naive in that it causes the whole connection to fail when an a single channel fails, but I don't think the abstractions in this library allow for the user to respond to channel-level events.
In my case, this type of failure occurred any time my consumers lost their connection to the RabbitMQ cluster and tried to send an ack for an old delivery tag to a different RMQ node in the cluster (i.e. queue is re-established on new node and batch acks send after reconnecting). This would result in the following error in amqplib:
But from the perspective of foo-foo-mq/rabbot, everything was fine. With these changes in place, I can at least listen for a
failed
event on the broker and respond appropriately.