Particular / NServiceBus

Build, version, and monitor better microservices with the most powerful service platform for .NET
https://particular.net/nservicebus/
Other
2.09k stars 647 forks source link

Outbox re-dispatches messages already marked as dispatched #3070

Closed bmihart closed 8 years ago

bmihart commented 8 years ago

Sympthoms

If a duplicate message arrives before the outbox persisters has cleaned up outgoing operations for dispatched outbox messages all operations will be redispatched as when the incoming message is deduplicated. This will force downstream endpoint to do unnessesary deduplication and also cause ripple effects since those endpoints would most likely generate duplicates as well.

Who's affected

All users making use of the Outbox feature that have duplicates arriving before the cleanup task has been executed.

Original bug report

Hi,

In our applications we are using NServiceBus 5 , with RabbitMqTransport, NHibernate and SQL Server for persistence. We also have Outbox feature enabled. While we were testing our applications we found the following behavior: If a message is sent the second time, Outbox will do the following:

  1. The message will be de-duplicated and the handler for that message will not be invoked
  2. All the messages that were emitted as part of the original message processing will be re-emitted, even if the handler was not invoked.

We would like to know if this is the intended behavior for de-duplication as we would like to use it as part of the system's auto-recovery mechanism, but we need to know if this kind of behavior will be supported in the following versions of NServiceBus and Outbox.

Thank you, Bogdan Mihart

andreasohlund commented 8 years ago

@bmihart thanks for raising this. Point no 2 is a flaw that we actually found last week and we're going to adjust that outbox api in v6 to allow the core to know if a message has been dispatched properly.

In the mean time we have two hotfixes beeing prepared for the persisters that will solve this problem without needing a breaking change.

https://github.com/Particular/NServiceBus.NHibernate/commit/e30acd748a7df1b60051fe913535a4000b0cf3cc

Would you be able to test the NHibernate patch to verify that it solves the issue you raised?

If yes you can find the pre release NServiceBus.NHibernate nuget here:

http://builds.particular.net/viewLog.html?buildId=133040&buildTypeId=NServiceBusNHibernate_Build&tab=artifacts#!

andreasohlund commented 8 years ago

@SzymonPobiega please review the description

bmihart commented 8 years ago

@andreasohlund, Thank you for your answer. We were thinking to use the fact that is re-emitting messages as part of our recovery mechanism, so we think it is a useful feature. Maybe you can put a configuration toggle and allow users to decide the re-emitting behaviour.

Thank you, Bogdan Mihart

andreasohlund commented 8 years ago

We were thinking to use the fact that is re-emitting messages as part of our recovery mechanism, so we think it is a useful feature. Maybe you can put a configuration toggle and allow users to decide the re-emitting behaviour.

I don't see the big benefit since we don't marked the message as dispatched until the transport has "acked" that the messages are "safe" within the queuing system. What scenarios would you envision where the message would be lost from that point on?

bmihart commented 8 years ago

@andreasohlund,

Our workflow is divided between multiple NServiceBus endpoints. We also need to provide resilience over multiple availability zones. In order to achieve that we have 2 instances of RabbitMq broker, each one in a different availability zone. Our NServiceBus endpoints are duplicated in the 2 availability zone, each instance pointing to the relevant RabbitMq broker. All the endpoints, no matter in which availability zone are located, share the same SQL Server database. So, in a happy scenario a workflow that will start in one availability zone will end in the same availability zone. But if, during the workflow, one RabbitMq broker suffers permanent damage and we lose all the messages stored on that broker, we would end up with partially finished workflows. Our plan is to store the messages that start the workflow in a separate database. Once the unhappy scenario described above happens we can re-emit the messages that start the workflow, using NServiceBus endpoints and RabbitMq broker from the other availability zone. Since all NServiceBus endpoints share the same database, the re-emit behaviour will assure us that the handlers will not be invoked again, but the relevant messages will be re-emitted and travel through the other NServiceBus endpoints until the workflow is complete.

andreasohlund commented 8 years ago

@bmihart I don't see us adding back the "re-emitting" of already dispatched messages since the problem you describes seems to be better solved at the infra level

I'm closing this since the re-dispatch bug mentioned has been fixed in the persisters. Please reopen if feel otherwise