Particular / NServiceBus.MessagingBridge

Other
4 stars 1 forks source link

Add MessageHeader for FailedQ when message is shoveled into bridge.error queue #212

Closed Floopy-Doo closed 1 year ago

Floopy-Doo commented 1 year ago

We are currently working on a system to recover messages from the error queue of a bridge transport and retry them, when the error is fixed. Currently there is no header indicating in which queue the original message resided before it was moved to the bridge error queue. (The normal error queue of NSB normally adds a FailedQ header to the message).

Our feature request would be to add a FailedQ (or similar) header to messages put into the transport error queue.

Pro

Con

kbaley commented 1 year ago

Hi @Floopy-Doo. According to the documentation (and I admit I haven't verified yet myself), messages that are moved to the bridge error queue should have the FailedQ header set to allow scripted retries. Is this not happening in your system?

Floopy-Doo commented 1 year ago

Yes, we are missing this header and therefore maybe more. We are using the bridge to close the gap between MSMQ and ASB.

Packages: NServiceBus 8.0.3 NServiceBus.Transport.AzureServiceBus 3.1.2 NServiceBus.Transport.Bridge 1.0.1 NServiceBus.Transport.Msmq 2.0.0

Header we have on the messages: (Message from ASB side) "NServiceBus.Transport.Encoding" "NServiceBus.MessageId" "NServiceBus.MessageIntent" "NServiceBus.ConversationId" "NServiceBus.CorrelationId" "NServiceBus.OriginatingMachine" "NServiceBus.OriginatingEndpoint" "$.diagnostics.originating.hostid" "NServiceBus.ContentType" "NServiceBus.EnclosedMessageTypes" "NServiceBus.Version" "NServiceBus.TimeSent" "NServiceBus.Transport.Bridge.ExceptionInfo.ExceptionType" "NServiceBus.ExceptionInfo.InnerExceptionType" "NServiceBus.Transport.Bridge.ExceptionInfo.HelpLink" "NServiceBus.Transport.Bridge.ExceptionInfo.Message" "NServiceBus.Transport.Bridge.ExceptionInfo.Source" "NServiceBus.Transport.Bridge.ExceptionInfo.StackTrace" "NServiceBus.Transport.Bridge.TimeOfFailure"

According to nuget we are up to date (except Nsb.T.Asb by a patch).

Do we need to configure something except the ErrorQueue?

dvdstelt commented 1 year ago

If a message is moved to the error queue, the endpoint on which the error occured will add its own queue to the FailedQ header. As a result, once the error is solved, the message can be send back to that queue, so it can be retried by the endpoint. So, for example, if Sales receives the message in the sales queue and NServiceBus decides to send it to the error queue, it will add sales to the FailedQ header.

It's possible that Sales is on Azure Service Bus (ASB) and the error queue is on RabbitMQ, for example. The bridge picks up the message from ASB and shovels it to RabbitMQ and puts it in the error queue on RabbitMQ. You can see in this line of code the FailedQ header is even transformed on messages that are send to the error queue.

Imagine the Bridge fails at it and moves the message to the bridge.error queue. If it overwrites the FailedQ header, the message can never be send back and retried.

Now, if I recall correctly, messages that end up in the bridge.error queue should always be transferred back to the queue where the Bridge picked it up. And that's what your problem currently seems to be, as there's no automated way to figure out which queue that is?

Floopy-Doo commented 1 year ago

@dvdstelt Yes that's exactly the problem. We could parse the "NServiceBus.Transport.Bridge.ExceptionInfo.Message" because the content provides a mentioning of the queue which it failed from, but that seems a bit flimsy to relay on the error message not changing and text parsing in general.

I would envision to have a additional header like "NServiceBus.Transport.Bridge.FailedQ" that would not override the normal NServiceBus.FailedQ header.

kbaley commented 1 year ago

This sounds like a good addition. We do regular enhancements to the platform and will make sure this is added to the pool. I can't say for sure when it'll be addressed but don't be afraid to comment here to ask for updates. Some of us really enjoy seeing the words "Floopy-Doo" come through our queue.

ThreePinkApples commented 1 year ago

This feature becomes more critical when handling messages put on bridge.error due to the service crashing. The exception message does not contain the queue in these cases, just an IServiceProvider error. We're left having to manually figure out where each message is supposed to be.

MarcWils commented 1 year ago

Any update on this issue? Or any idea when this will be addressed?

kbaley commented 1 year ago

@MarcWils Still no set date but by asking about it, you've implicitly bumped the priority when we look for ways to improve the platform.

ramonsmits commented 1 year ago

@Floopy-Doo @MarcWils @ThreePinkApples Could you share the reasons for messages to be forwarded to the error queue? This information can help us improve the messaging bridge.

Floopy-Doo commented 1 year ago

@ramonsmits We had so far the following reasons:

We are using MSMQ and AzureServiceBus.

ThreePinkApples commented 1 year ago

@ramonsmits For us it's mainly due to our "BridgeService" crashing during startup and the message that are currently being processed is then dumped into the error queue. This can also happen during a regular restart/redeploy of the service where we do not have any crashes. Configuration issues do happen but are mostly prevented through unit tests.

The crashes are mostly caused by connection timeouts towards RabbitMQ, and we haven't found any solution to that issue. We have 39 MSMQ endpoints and 25 RabbitMQ endpoints being bridged and almost every time we redeploy/restart our "BridgeService" it needs several attempts before it is able to set up all RabbitMQ connections without any of the connections timing out. We experimented with increasing RabbitMQ resources and timeout settings but it hasn't helped us, and we've shifted our focus to moving more and more over to the RabbitMQ side so we can reduce the number of endpoints being bridged.

dvdstelt commented 1 year ago

The crashes are mostly caused by connection timeouts towards RabbitMQ, and we haven't found any solution to that issue.

Do you have the same timeout issues with regular endpoints? Not just the Bridge?

ramonsmits commented 1 year ago

@ThreePinkApples @Floopy-Doo If you have any details when/why the bridge fails at (re)start that would be helpful.

If you have many endpoints that need to be bridged you could consider hosting multiple bridges with a reduced number of endpoints.

ramonsmits commented 1 year ago

@ThreePinkApples @Floopy-Doo A shutdown of the messages bridge should not result in any messages being moved to the error queue.

ThreePinkApples commented 1 year ago

The crashes are mostly caused by connection timeouts towards RabbitMQ, and we haven't found any solution to that issue.

Do you have the same timeout issues with regular endpoints? Not just the Bridge?

No, but I guess the Bridge is unique since it is one service setting up so many connections at the same time

@ThreePinkApples @Floopy-Doo If you have any details when/why the bridge fails at (re)start that would be helpful.

If you have many endpoints that need to be bridged you could consider hosting multiple bridges with a reduced number of endpoints.

We did look into splitting the service up into smaller chunks, the problem is that there is so much "crosstalk" between services that there were few endpoints that could be isolated into its own Bridge service. We have a few cases where there is one MSMQ service that only communicates with RabbitMQ services, so we're working on moving these over to RabbitMQ first, that should at least significantly reduce the number of messages that end up on the error queue

Floopy-Doo commented 1 year ago

@ThreePinkApples @Floopy-Doo A shutdown of the messages bridge should not result in any messages being moved to the error queue.

We get the following error:

System.Exception: Failed to shovel message for endpoint somequeue with id 7b13bb61-fc3b-4c09-b7fb-6ff7e2e4b7f3\452901742 from msmq to azureservicebus ---> System.ObjectDisposedException: ServiceBusSender has already been closed and cannot perform the requested operation.
Object name: 'ServiceBusSender'.
   at Azure.Core.Argument.AssertNotDisposed(Boolean wasDisposed, String targetName)
   at Azure.Messaging.ServiceBus.ServiceBusSender.<CreateMessageBatchAsync>d__30.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.<DispatchBatchForDestination>d__7.MoveNext() in /_/src/Transport/Sending/MessageDispatcher.cs:line 143
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.<Dispatch>d__5.MoveNext() in /_/src/Transport/Sending/MessageDispatcher.cs:line 97
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at MessageShovel.<TransferMessage>d__1.MoveNext() in /_/src/NServiceBus.Transport.Bridge/MessageShovel.cs:line 65
   --- End of inner exception stack trace ---
   at MessageShovel.<TransferMessage>d__1.MoveNext() in /_/src/NServiceBus.Transport.Bridge/MessageShovel.cs:line 73
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at NServiceBus.Transport.Msmq.ReceiveStrategy.<TryProcessMessage>d__7.MoveNext() in /_/src/NServiceBus.Transport.Msmq/ReceiveStrategy.cs:line 112
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at NServiceBus.Transport.Msmq.ReceiveOnlyNativeTransactionStrategy.<ProcessMessage>d__2.MoveNext() in /_/src/NServiceBus.Transport.Msmq/ReceiveOnlyNativeTransactionStrategy.cs:line 93
ramonsmits commented 1 year ago

We get the following error:

@Floopy-Doo Does that happen immediately when shutting down? I would not expect code to throw an ObjectDisposedException during a regular graceful shutdown. Only after the endpoint shutdown will timeout such exceptions could occur because the code would then be in an undefined state.

By the way, this could just be an issue with https://github.com/Particular/NServiceBus.Transport.AzureServiceBus and not the bridge.

Do you also see this for regular endpoints in your system?

Floopy-Doo commented 1 year ago

@ramonsmits

We get the following error:

@Floopy-Doo Does that happen immediately when shutting down? I would not expect code to throw an ObjectDisposedException during a regular graceful shutdown. Only after the endpoint shutdown will timeout such exceptions could occur because the code would then be in an undefined state.

By the way, this could just be an issue with https://github.com/Particular/NServiceBus.Transport.AzureServiceBus and not the bridge.

We have this issue when we stop the service to do maintenance work (so a graceful shutdown) while its queues are not completely empty (that a majority of the time).

Do you also see this for regular endpoints in your system?

We haven't had that issues with any other endpoints as of now, just the bridge. But we have only a small portion of our services migrated to ASB.

dvdstelt commented 1 year ago

The newly added header will be called NServiceBus.MessagingBridge.FailedQ and released with version 2.0.1 soon.

dvdstelt commented 1 year ago

@Floopy-Doo @MarcWils @ThreePinkApples We just released version 2.0.1 that adds a header to the queue. Documentation can be found here. We'll announce it shortly on https://discuss.particular.net/ Let us know if this works better for you, even though it's not optimal as we don't provide the tools to move messages back. But it's on our list of things to do, although we might have some discussions what the best way is. For example, if people move from TransportA to TransportB and ServiceControl is running on TransportB, there's no way to use ServicePulse to retry those messages. So we might need something differently, but we're not sure what it'd be. Yet.