Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
50 stars 46 forks source link

Audit message ingestion fails and doesn't forward messages to audit log queue (version 5.2.0) #4185

Open Jokelab opened 1 month ago

Jokelab commented 1 month ago

Describe the bug

This bug is caused by https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/issues/994.

Description

After upgrading from ServiceControl 5.0.4 to 5.2.0, audit message ingestion fails and messages are not forwarded to the configured audit log queue.

Expected behavior

Audit messages are ingested and forwarded to the configured audit log queue.

Actual behavior

Versions

ServiceControl 5.2.0 Transport: Azure ServiceBus (premium) I explicitly mention this because our system sometimes needs to process messages slightly bigger than 1MB.

Our endpoints are Azure Functions using the following versions: NServiceBus 9.0.0 NServiceBus.Transport.AzureServiceBus 4.0.0
NServiceBus.AzureFunctions.Worker.ServiceBus 5.0.0

Steps to reproduce

Expected: messages are ingested from the audit queue and forwarded to the audit log queue Actual: messages stay in the audit queue and are not forwarded to the audit log queue.

Relevant log output

In the audit instance log:
2024-05-21 08:03:46.8118|16|Fatal|ServiceControl.Audit.Auditing.AuditIngestion|OnCriticalError. 'Failed to import too many times'|System.InvalidOperationException: The link 'G6:14740510:amqps://**namespace removed**.servicebus.windows.net/-dd910ad4;0:7:8' is force detached by the broker because publisher(link3082) received a batch message with no data in it. Detach origin: Publisher.
For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(IReadOnlyCollection`1 messages, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.<>c.<<SendBatchAsync>b__21_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.<>c__22`1.<<RunOperation>b__22_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.ServiceBusSender.SendMessagesAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.DispatchBatchForDestination(String destination, ServiceBusClient client, Transaction transaction, Queue`1 messagesToSend, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 198
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.Dispatch(TransportOperations outgoingMessages, TransportTransaction transaction, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 101
   at ServiceControl.Audit.Auditing.AuditIngestor.Ingest(List`1 contexts) in /_/src/ServiceControl.Audit/Auditing/AuditIngestor.cs:line 72
   at ServiceControl.Audit.Auditing.AuditIngestion.Loop() in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 218
   at ServiceControl.Audit.Auditing.AuditIngestion.OnMessage(MessageContext messageContext, CancellationToken cancellationToken) in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 197
   at NServiceBus.Transport.AzureServiceBus.MessagePump.ProcessMessage(ServiceBusReceivedMessage message, ProcessMessageEventArgs processMessageEventArgs, String messageId, Dictionary`2 headers, BinaryData body, CancellationToken messageProcessingCancellationToken) in /_/src/Transport/Receiving/MessagePump.cs:line 285

Sample of an error message in the FailedImports folder:
Exception:
The link 'G6:14740510:amqps://**namespace removed**.servicebus.windows.net/-dd910ad4;0:7:8' is force detached by the broker because publisher(link3082) received a batch message with no data in it. Detach origin: Publisher.
For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
StackTrace:
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(IReadOnlyCollection`1 messages, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.<>c.<<SendBatchAsync>b__21_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.<>c__22`1.<<RunOperation>b__22_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.ServiceBusSender.SendMessagesAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.DispatchBatchForDestination(String destination, ServiceBusClient client, Transaction transaction, Queue`1 messagesToSend, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 198
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.Dispatch(TransportOperations outgoingMessages, TransportTransaction transaction, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 101
   at ServiceControl.Audit.Auditing.AuditIngestor.Ingest(List`1 contexts) in /_/src/ServiceControl.Audit/Auditing/AuditIngestor.cs:line 72
   at ServiceControl.Audit.Auditing.AuditIngestion.Loop() in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 218
   at ServiceControl.Audit.Auditing.AuditIngestion.OnMessage(MessageContext messageContext, CancellationToken cancellationToken) in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 197
   at NServiceBus.Transport.AzureServiceBus.MessagePump.ProcessMessage(ServiceBusReceivedMessage message, ProcessMessageEventArgs processMessageEventArgs, String messageId, Dictionary`2 headers, BinaryData body, CancellationToken messageProcessingCancellationToken) in /_/src/Transport/Receiving/MessagePump.cs:line 285
Source:
Azure.Messaging.ServiceBus
TargetSite:
Void MoveNext()

Additional Information

Workarounds

Reinstalling everything with ServiceControl version to 5.0.4 resolved the issue.

Additional information

I tried re-importing the messages as suggested in the documentation https://docs.particular.net/servicecontrol/import-failed-messages, but this keeps triggering the same exception. To make sure it is related to this version and that the issue doesn't occur in 5.0.4, I installed 5.0.4. again and tested that everything is ok. Then upgraded to 5.2.0 again and the issue appears.

adamralph commented 1 month ago

hi @Jokelab thanks for raising this. We'll try and reproduce the issue and then post an update here.

adamralph commented 1 month ago

@Jokelab I followed your repro steps but I couldn't reproduce the problem. After the upgrade to 5.2.0, messages were ingested from the audit queue and forwarded to the audit log queue, as expected.

FYI I made each message approximately 2MB in size.

Is there anything missing from your repro steps?

Jokelab commented 1 month ago

@adamralph Thank you very much for taking the effort in trying to reproduce the issue. I must be overlooking something. I tried to reproduce the issue myself today with a fresh Azure ServiceBus namespace and a new Function App. The issue didn't occur this time.

Therefore I tried to update our actual instance again, hoping it was just a fluke, but now the issue re-appears :( . I noticed that not all messages fail to be forwarded, just a few of them, but I cannot find anything specific about the failing messages when I investigate them in the RavenDb database. They have different types and sizes. I will try to find the differences between my repro steps and the actual situation a bit more and come back here once I figure it out.

Jokelab commented 1 month ago

So I investigated the issue a little more and have found a way to consequently reproduce it. It turns out that it's not related to the update, but also occurs with a fresh install of ServiceControl5.2.0.

To reproduce the issue, I slightly modified the ASBWorker sample code for Azure Functions (isolated) on your website. The code is attached to this comment. The HttpSender function now has the following parameters:

Steps to reproduce

Another simple way to get the same error message, but now on the endpoint side:

This works fine: /api/HttpSender?followup=10&sizekb=100&immediate=false

It also works fine when sent immediately: /api/HttpSender?followup=10&sizekb=100&immediate=true

But when I increase the message size to 110Kb, I get the error message: /api/HttpSender?followup=10&sizekb=110&immediate=true

I hope this helps to investigate the issue. Best regards!

adamralph commented 1 month ago

@Jokelab thanks for the detailed repro steps. I've managed to reproduce this now in the SC Audit instance. I'll continue to investigate and I'll comment back here when I have more info.


BTW, this alternative repro didn't work for me:

But when I increase the message size to 110Kb, I get the error message: /api/HttpSender?followup=10&sizekb=110&immediate=true

The messages were both sent and received without any errors.

Jokelab commented 1 month ago

Thank you for looking into this!

BTW, this alternative repro didn't work for me:

But when I increase the message size to 110Kb, I get the error message: /api/HttpSender?followup=10&sizekb=110&immediate=true

The messages were both sent and received without any errors.

I think this depends on the batch size. For me 110 was enough to trigger to exception. Maybe increasing this to a larger number (500) triggers it for you as well.

adamralph commented 1 month ago

@Jokelab I found the problem with the alternate repro. It has to be set to immediate=false, not immediate=true.

I.e. this reproduces the problem:

/api/HttpSender?followup=10&sizekb=110&immediate=false

adamralph commented 1 month ago

@Jokelab we've found the source of the problem.

By default, outgoing messages from a message handler are sent in batches, after the handler exits (that can be suppressed with SendOptions.RequireImmediateDispatch as your repro code demonstrates).

We've discovered that, if the size of the entire batch is more than the allowed size per message (which we are setting to 102,400 KB in this example), the Azure SDK client throws this rather misleading exception.

This manifests as a bug in our Azure Service Bus Transport, which in turn manifests as a bug in ServiceControl, because the transport is used internally in ServiceControl.

We will follow up with the Azure SDK team to see what can be done about this and get back to you.

I'm not sure if there's another workaround available in the meantime other than sticking with an older version of ServiceControl, but if we do identify one then we'll let you know.

Jokelab commented 1 month ago

@adamralph Thank you for clarifying! Is it safe to say that the issue that we can observe on the endpoint side (when triggering /api/HttpSender?followup=10&sizekb=110&immediate=false) is the same as on the ServiceControl side? For this reason we sometimes use immediate dispatch instead of the preferred method (in batch), with all the risk that entails. For now we'll stick on our current version (5.0.4).

adamralph commented 1 month ago

Is it safe to say that the issue that we can observe on the endpoint side (when triggering /api/HttpSender?followup=10&sizekb=110&immediate=false) is the same as on the ServiceControl side?

@Jokelab yes, it's exactly the same problem.

adamralph commented 1 month ago

@Jokelab FYI I've raised a corresponding bug for the transport — https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/issues/994

adamralph commented 1 month ago

@Jokelab please follow both this issue and https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/issues/994 for updates on the status of the bug fixes.

If at any point, the bugs fixes become critical for you, please request critical support from https://particular.net/support

adamralph commented 1 month ago

@Jokelab FYI we've prioritized this near the top of our list and will work on it as soon as we can.