Closed pardahlman closed 4 years ago
Are there any errors in the SC log or RabbitMQ log around the time when it stops accepting new messages?
There are two timestamps that looks relevant from the failed message's header:
I can see no errors in the RabbitMQ broker logs from this period of time. From the time where around NServiceBus.TimeSent
:
2020-06-19 19:13:32.1173|51|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 19:14:02.1339|66|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 19:14:02.1339|66|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 19:14:32.1520|65|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 19:14:32.1520|65|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
The time of NServiceBus.TimeOfFailure
is the same day as we restarted ServiceControl to get the consumer up and running. That day as stack traces similar to the one above (probably as a result of running ServicePulse with the import flag). Otherwise it is not much happening:
2020-06-23 12:19:08.7807|27|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-23 12:19:38.7992|29|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-23 12:19:38.7992|55|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-23 12:20:08.8158|80|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-23 12:20:08.8158|55|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
@pardahlman can you created a support ticket? It sounds like we will need some more information. Lack of error messages in the logs may suggest that the fact that service control stopped processing messages might not be related to that unfortunate message that was missing the header.
What ServiceControl does when it detects such message is retry processing it 5 times and then move it to the FailedErrorImports
collection in the RavenDB database. When you invoke the retry failed messages processing from the command line, the retried messages come from that collection.
Related to https://github.com/Particular/ServiceControl/pull/2013
Missing header means messages are sent to the error queue which should not have been forwarded to the error queue.
Thanks for the input, @ramonsmits - do you think that is what happened here? E.g. a message that shouldn't be forwarded to the error queue was forwarded by ServicePulse?
@SzymonPobiega I took a closer look at the logs from the 19th (day of failure). I realized that the timestamp in the message header is UTC and the timestamp in the log is local time, so there is a two hour difference. There is something of relevance that stands out a few seconds after NServiceBus.TimeSent
2020-06-19 21:15:01.3064|38|Info|NServiceBus.Raw.RunningRawEndpointInstance|Stopping receiver.
2020-06-19 21:15:04.8890|32|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 21:15:04.8890|38|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 21:15:34.8982|67|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 21:15:34.8982|32|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
2020-06-19 21:16:04.9026|38|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to forward.
2020-06-19 21:16:04.9026|38|Info|ServiceControl.Recoverability.RetryProcessor|No batch found to stage.
"Stopping receiver" is something from the raw endpoint that I believe is invoked by Watchdog.OnFailure
.
Let me know if this is enought for you to keep on investigating? (Perhaps a simple repro would be to send a message without the FailedQ-header and see what happens internally in SP -- haven't tried this myself, though 🤷♂️ )
"Stopping receiver" is something from the raw endpoint that I believe is invoked by Watchdog.OnFailure.
Is there anything interesting above this one? The receiver should not stop and even if it stops, it should automatically restart in one minute.
To me, there are nothing standing out prior to the "Stopping receiver", and I see nothing in the logs that suggests that it has started again (but perhaps that would be logged in a lower level?) I'll attach the full log file so that you investigate yourself!
@pardahlman I managed to reproduce this behavior. We are going to make this issue high priority. I'll ping when we start working on it and when it is fixed.
Hey @SzymonPobiega and @ramonsmits ,
We actually have this same issue happening which forces someone to manually intervene to restart things.
We have it as number 3 in the priority bug queue so we expect to start working on it soon. Sorry for the problems :(
Hey @SzymonPobiega ,
We also noticed this even without import failed error messages FYI
Hey @SzymonPobiega ,
Any updates on priority on this?
@pardahlman @TraGicCode we have started working on this. We expect to release the fix within the next two weeks. Thanks for your patience.
We released a fix to it in the latest 4.12.1. By accident we created a duplicate public issue here https://github.com/Particular/ServiceControl/issues/2148
Hey @SzymonPobiega ,
Thank you. I will figure out when we can upgrade and follow up if there is still an issue. Can you close this issue?
We are running NServiceBus (7.2.0) over RabbitMQ (5.1.2) with ServiceControl (4.6.0) and ServicePulse (1.24.3).
On two separate occasions the message import has failed, and both times this has halted the message consumption from the error queue altogether. Looking in the RabbitMQ management tool, I can see that 30 messages are Unacked (being processed) and the amount of Ready messages (waiting to be consumed) keeps on building up. In this state, no new error message will be imported to ServiceControl and ServicePulse wont be updated with new failed messages. This is a critical bug, as we rely on ServicePulse for error handling and also use it as an indication of the number of errors in our production environment.
If ServiceControl is restarted, message consumption starts again and the error messages are consumed, but there is still one problematic message that will get stuck and cause message processing to halt again.
When trying to run ServiceControl with
--import-failed-errors
it fails:Based on the stack trace, I believe that this line of code is the source of the exception. It expects the header
FaultsHeaderKeys.FailedQ
exists, but when looking at the problematic message, the header is not present. The message does have aNServiceBus.ExceptionInfo.StackTrace
header with a stack trace from ServiceControl, which makes me believe that the message is published from ServiceControler (possibly when callingIErrorHandlingPolicyContext.MoveToErrorQueue
?).I'm not familiar with the inner workings of ServiceControl, but one theory is that the error message import failed for some reason (not known because the original stack trace is overwritten with one from ServiceControl?), and the
DefaultErrorHandlingPolicy
is triggered, which posts the message to the error queue, and this message does lacks theFaultsHeaderKeys.FailedQ
header?