Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
52 stars 47 forks source link

Retry message fails from ServicePulse after upgrade to ServiceControl 5.2.0 #4180

Closed saschanm closed 4 months ago

saschanm commented 5 months ago

Describe the bug

Description

Retry fails from Service Pulse after upgrading ServiceControl from v5.1.2 to v5.2.0

Expected behavior

Retry of message from ServicePulse should return the messages to the failed queue for processing

Actual behavior

Results in Failed to execute recoverability policy for message with native ID: ...

Versions

Please list the version of the relevant packages or applications in which the bug exists.

SC: Version 5.2.0 SP: Version 1.38.3 NServiceBus: Version 7.7

Steps to reproduce

Note - the original failed message has the NServiceBus.FailedQ header entry as expected - while the retried message ending up in the error queue does not.

Relevant log output

No response

Additional Information

Workarounds

No workaround found

Possible solutions

If no resolution found I will attempt to downgrade back to SC v5.1.2 - but would prefer not to given potential risks

Additional information

saschanm commented 5 months ago

A further issue appears to have occured with the upgrade to 5.2.0

We have an endpoint that handled custom check failure messages and posts a notification to a Teams channel. These have also stopped working though where in the pipeline they are failing is unclear.

I have gone through downgrade process in our test environment that was on the same versions and showing the same issues.

I have confirmed that downgrade resolves both issues - but the downgrade process resulted in loss of failed messages in the ravenDB instance - so that is not a viable option in our production environment.

saschanm commented 5 months ago

Additional info: The instance/endpoints experiencing the problems are using MSMQ transport.

We have another instance for a another project that is using Azure Service Bus. I have just confirmed that retries from that system are processing correctly.

andreasohlund commented 5 months ago

Thanks for the detailed bugreport @saschanm , we are looking into it

andreasohlund commented 5 months ago

@saschanm would you be able to send the headers and the body of one of the failing messages to us? (support@particular.net)

saschanm commented 5 months ago

@andreasohlund Email with attachments sent to support@particular.net with subject line: Retry message fails from ServicePulse after upgrade to ServiceControl 5.2.0 (issue 4180)

andreasohlund commented 5 months ago

I've tried to reproduce this by:

  1. Installing SC 5.2
  2. Configured it to use MSMQ
  3. Ran https://docs.particular.net/samples/msmq/simple/ and simulated a failure
  4. Verified that it got picked up by SC
  5. Retried it via the ServicePulse UI
  6. Verified that it got retried correctly by the endpoint

@saschanm does the above cover your scenario? (If yes it looks like it might be some specific details of the failing messages on your end that are causing this issue)

andreasohlund commented 5 months ago

Email with attachments sent to support@particular.net with subject line:

Thanks, we will take a deeper look

saschanm commented 5 months ago

Also note an example of error in ServiceControl logs for attempting to ingest the error queue message after retry attempt - though it may be the problem is upstream, this is the end result.

2024-05-19 00:02:48.4450|52|Warn|ServiceControl.Operations.ErrorProcessor|Processing of message '25e9e93e-97b5-434f-a512-fc7b393b5bea\215537358' failed. System.Exception: Missing 'NServiceBus.FailedQ' header. Message is poison message or incorrectly send to (error) queue. at ServiceControl.Operations.FailedMessageFactory.ParseFailureDetails(IReadOnlyDictionary``2 headers) in /_/src/ServiceControl/Operations/FailedMessageFactory.cs:line 41 at ServiceControl.Operations.ErrorProcessor.ProcessMessage(MessageContext context, IIngestionUnitOfWork unitOfWork) in /_/src/ServiceControl/Operations/ErrorProcessor.cs:line 114

andreasohlund commented 5 months ago

@saschanm and other following this, we are discussing the cause and a potentially fix here https://github.com/Particular/NServiceBus.Transport.Msmq/pull/710

andreasohlund commented 4 months ago

@saschanm v5.2.1 has now been released with a fix for this

https://particular.net/start-servicecontrol-download

thanks again for your extensive help with the investigation ❤️