Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
52 stars 47 forks source link

Error message import fails, no new error messages imported #2058

Closed pardahlman closed 4 years ago

pardahlman commented 4 years ago

We are running NServiceBus (7.2.0) over RabbitMQ (5.1.2) with ServiceControl (4.6.0) and ServicePulse (1.24.3).

On two separate occasions the message import has failed, and both times this has halted the message consumption from the error queue altogether. Looking in the RabbitMQ management tool, I can see that 30 messages are Unacked (being processed) and the amount of Ready messages (waiting to be consumed) keeps on building up. In this state, no new error message will be imported to ServiceControl and ServicePulse wont be updated with new failed messages. This is a critical bug, as we rely on ServicePulse for error handling and also use it as an indication of the number of errors in our production environment.

If ServiceControl is restarted, message consumption starts again and the error messages are consumed, but there is still one problematic message that will get stuck and cause message processing to halt again.

When trying to run ServiceControl with --import-failed-errors it fails:

 .\ServiceControl.exe --import-failed-errors
2020-06-23 09:34:42.6456|1|Info|ServiceBus.Management.Infrastructure.Settings.Settings|No settings found for error log queue to import, default name will be used
2020-06-23 09:34:43.4322|1|Info|Particular.ServiceControl.Bootstrapper|
-------------------------------------------------------------
ServiceControl Version:             4.0.0.0
Audit Retention Period (optional):
Error Retention Period:             45.00:00:00
Ingest Error Messages:              False
Forwarding Error Messages:          False
Database Size:                      92274688 bytes
ServiceControl Logging Level:       Info
RavenDB Logging Level:              Info
Selected Transport Customization:   ServiceControl.Transports.RabbitMQ.RabbitMQConventionalRoutingTransportCustomization, ServiceControl.Transports.RabbitMQ
-------------------------------------------------------------
2020-06-23 09:34:44.8403|1|Info|ServiceControl.Infrastructure.RavenDB.RavenBootstrapper|Loading Embedded RavenDB license
2020-06-23 09:34:49.6988|1|Info|ServiceControl.Infrastructure.RavenDB.RavenBootstrapper|Index creation started
2020-06-23 09:34:52.8683|1|Info|NServiceBus.LicenseManager|Selected active license from C:\ProgramData\ParticularSoftware\license.xml
License Expiration: 2020-12-15
2020-06-23 09:34:55.4034|24|Warn|ServiceControl.Operations.FailedErrorImportCustomCheck|One or more error messages have failed to import properly into ServiceControl and have been stored in the ServiceControl database.
The import of these messages could have failed for a number of reasons and ServiceControl is not able to automatically reimport them. For guidance on how to resolve this see https://docs.particular.net/servicecontrol/import-failed-messages
2020-06-23 09:34:58.4302|7|Info|Particular.ServiceControl.Bootstrapper|Api is now accepting requests on http://localhost:33333/api
2020-06-23 09:34:59.3067|35|Error|ServiceControl.Operations.ImportFailedErrors|Error while attempting to re-import failed error message f9d7c411-ed87-4fdc-894a-abdf013d39df.
System.Collections.Generic.KeyNotFoundException: The given key was not present in the dictionary.
   at System.ThrowHelper.ThrowKeyNotFoundException()
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at ServiceControl.Operations.FailedMessageFactory.ParseFailureDetails(IReadOnlyDictionary`2 headers)
   at ServiceControl.Operations.ErrorPersister.<Persist>d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at ServiceControl.Operations.ErrorIngestor.<Ingest>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at ServiceControl.Operations.ImportFailedErrors.<Run>d__1.MoveNext()

Based on the stack trace, I believe that this line of code is the source of the exception. It expects the header FaultsHeaderKeys.FailedQ exists, but when looking at the problematic message, the header is not present. The message does have a NServiceBus.ExceptionInfo.StackTrace header with a stack trace from ServiceControl, which makes me believe that the message is published from ServiceControler (possibly when calling IErrorHandlingPolicyContext.MoveToErrorQueue?).

I'm not familiar with the inner workings of ServiceControl, but one theory is that the error message import failed for some reason (not known because the original stack trace is overwritten with one from ServiceControl?), and the DefaultErrorHandlingPolicy is triggered, which posts the message to the error queue, and this message does lacks the FaultsHeaderKeys.FailedQ header?

SzymonPobiega commented 4 years ago

Are there any errors in the SC log or RabbitMQ log around the time when it stops accepting new messages?

pardahlman commented 4 years ago

There are two timestamps that looks relevant from the failed message's header:

I can see no errors in the RabbitMQ broker logs from this period of time. From the time where around NServiceBus.TimeSent:

2020-06-19 19:13:32.1173|51|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 19:14:02.1339|66|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 19:14:02.1339|66|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 19:14:32.1520|65|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 19:14:32.1520|65|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.

The time of NServiceBus.TimeOfFailure is the same day as we restarted ServiceControl to get the consumer up and running. That day as stack traces similar to the one above (probably as a result of running ServicePulse with the import flag). Otherwise it is not much happening:

2020-06-23 12:19:08.7807|27|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-23 12:19:38.7992|29|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-23 12:19:38.7992|55|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-23 12:20:08.8158|80|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-23 12:20:08.8158|55|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
SzymonPobiega commented 4 years ago

@pardahlman can you created a support ticket? It sounds like we will need some more information. Lack of error messages in the logs may suggest that the fact that service control stopped processing messages might not be related to that unfortunate message that was missing the header.

What ServiceControl does when it detects such message is retry processing it 5 times and then move it to the FailedErrorImports collection in the RavenDB database. When you invoke the retry failed messages processing from the command line, the retried messages come from that collection.

ramonsmits commented 4 years ago

Related to https://github.com/Particular/ServiceControl/pull/2013

Missing header means messages are sent to the error queue which should not have been forwarded to the error queue.

pardahlman commented 4 years ago

Thanks for the input, @ramonsmits - do you think that is what happened here? E.g. a message that shouldn't be forwarded to the error queue was forwarded by ServicePulse?

@SzymonPobiega I took a closer look at the logs from the 19th (day of failure). I realized that the timestamp in the message header is UTC and the timestamp in the log is local time, so there is a two hour difference. There is something of relevance that stands out a few seconds after NServiceBus.TimeSent

2020-06-19 21:15:01.3064|38|Info|NServiceBus.Raw.RunningRawEndpointInstance|Stopping receiver.
2020-06-19 21:15:04.8890|32|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 21:15:04.8890|38|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 21:15:34.8982|67|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 21:15:34.8982|32|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 21:16:04.9026|38|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 21:16:04.9026|38|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.

"Stopping receiver" is something from the raw endpoint that I believe is invoked by Watchdog.OnFailure.

Let me know if this is enought for you to keep on investigating? (Perhaps a simple repro would be to send a message without the FailedQ-header and see what happens internally in SP -- haven't tried this myself, though 🤷‍♂️ )

SzymonPobiega commented 4 years ago

"Stopping receiver" is something from the raw endpoint that I believe is invoked by Watchdog.OnFailure.

Is there anything interesting above this one? The receiver should not stop and even if it stops, it should automatically restart in one minute.

pardahlman commented 4 years ago

To me, there are nothing standing out prior to the "Stopping receiver", and I see nothing in the logs that suggests that it has started again (but perhaps that would be logged in a lower level?) I'll attach the full log file so that you investigate yourself!

logfile.20200619.0.txt

SzymonPobiega commented 4 years ago

@pardahlman I managed to reproduce this behavior. We are going to make this issue high priority. I'll ping when we start working on it and when it is fixed.

TraGicCode commented 4 years ago

Hey @SzymonPobiega and @ramonsmits ,

We actually have this same issue happening which forces someone to manually intervene to restart things.

SzymonPobiega commented 4 years ago

We have it as number 3 in the priority bug queue so we expect to start working on it soon. Sorry for the problems :(

TraGicCode commented 4 years ago

Hey @SzymonPobiega ,

We also noticed this even without import failed error messages FYI

TraGicCode commented 4 years ago

Hey @SzymonPobiega ,

Any updates on priority on this?

SzymonPobiega commented 4 years ago

@pardahlman @TraGicCode we have started working on this. We expect to release the fix within the next two weeks. Thanks for your patience.

SzymonPobiega commented 4 years ago

We released a fix to it in the latest 4.12.1. By accident we created a duplicate public issue here https://github.com/Particular/ServiceControl/issues/2148

TraGicCode commented 4 years ago

Hey @SzymonPobiega ,

Thank you. I will figure out when we can upgrade and follow up if there is still an issue. Can you close this issue?