Particular / NServiceBus.Azure

Transports and storage for Windows Azure
https://docs.particular.net/nservicebus/azure/
Other
10 stars 15 forks source link

NServiceBus stops dequeueing messages from Azure Service Bus queues #295

Closed daerath closed 7 years ago

daerath commented 7 years ago

We've seen this issue off and on for a while now (in both NSB 5 and NSB 6). I have some new information though to does narrow things down a bit (hopefully).

First, here are the versions of what should be the relevant nugets:

<package id="log4net" version="2.0.5" targetFramework="net452" />
<package id="Microsoft.Azure.Common" version="2.1.0" targetFramework="net461" />
<package id="Microsoft.Azure.Common.Dependencies" version="1.0.0" targetFramework="net452" />
<package id="NHibernate" version="4.1.1.4000" targetFramework="net461" />
<package id="NHibernate.SqlAzure" version="2.1.5" targetFramework="net461" />
<package id="NServiceBus" version="6.1.4" targetFramework="net461" />
<package id="NServiceBus.Autofac" version="6.0.1" targetFramework="net461" />
<package id="NServiceBus.Azure" version="7.0.0" targetFramework="net461" />
<package id="NServiceBus.Azure.Transports.WindowsAzureServiceBus" version="7.2.0" targetFramework="net461" />
<package id="NServiceBus.DataBus.AzureBlobStorage" version="1.1.0" targetFramework="net461" />
<package id="NServiceBus.Hosting.Azure" version="7.1.2" targetFramework="net461" />
<package id="NServiceBus.Hosting.Azure.HostProcess" version="7.1.2" targetFramework="net461" />
<package id="NServiceBus.Log4Net" version="2.0.0" targetFramework="net461" />
<package id="NServiceBus.NHibernate" version="7.1.4" targetFramework="net461" />
<package id="NServiceBus.Persistence.AzureStorage" version="1.2.0" targetFramework="net461" />
<package id="WindowsAzure.ServiceBus" version="4.0.0" targetFramework="net461" />
<package id="WindowsAzure.Storage" version="8.1.1" targetFramework="net461" />

Ok, the behavior is this. We have cloud services that send and receive messages via Azure Service Bus. Then, without us changing anything (no source code changes, nothing in the cloud config, etc.) the waworkerhost.exe process will drop to 0% CPU and stay there indefinitely. There are no error messages. Nothing in the event log. I’ve tried looking in dbgview to see if anything jumps out at me and nothing did (although there was a lot of noise in there. Cloud services chat up the diagnostic trace log it seems). I also tried using procdump to display first chance exceptions and didn’t see anything there either. I don’t have a tracking ID or anything I could use to generate a support ticket.

There is an underlying component to this in Azure because it hits all instances of the same service nearly simultaneously. Whatever is happening isn’t being handled within NServiceBus and processing halts. From what we’ve seen, it never comes back. The only “fix” we've found so far is to have an external process that watches queue sizes and measures CPU use on waworkerhost.exe. If it detects the problem it will kill waworkerhost.exe. Azure restarts it and we continue. This repeats, typically for a few days, when the problem goes away as mysteriously as it appeared. This works, but it is far from ideal.

I was chatting with Sean Feldman about an unrelated issue and he directed me to a page detailing Critical Error Behavior. I took the code from that page and modified one of our services in test and deployed it.

The same problem happened and the critical error behavior did not activate. Nothing was logged. It sat at 0% CPU.

Until today this only hit one service out of a dozen. That service uses sagas, timeouts, and azure service bus deduplication, so we weren’t 100% sure if this issue was with SQL Azure, or the bus, or was somehow related to timeouts. However, it seemed like it was a bus issue because we stopped dequeueing. Today I saw the same problem hit a different service. That one is pretty simple. It processes messages and calls a Web API. No sagas. No dedup. No SQL Azure. The only common theme between the two is the service bus.

I managed to get a couple of screenshots of the service bus performance counters and this is what I see after it happens. Unfortunately I didn’t get a screenshot of the exact moment it occurred, but that one showed processing and then suddenly all trace lines dropped to 0. Then it appears to start back up, it may process a message or two, and then it stops again.

I have a feeling that this is related, in some way, to load, but that’s purely a guess. The system runs through tens of millions of messages a day. The two services that have manifested this behavior are involved in almost every message and we’re running 4 – 6 instances of those. The other services are far lower load (and typically 2 instances) and have not manifested this problem.

This is a screenshot of perf counters after it had stopped. The line that spiked is ReceiveMessageLatency. stops

This screenshot is after the first one. We go completely dead, then it looks like it tries to come back, but doesn't go anywhere. Then there are more ReceiveMessage Latency spikes and then it goes back down to 0 again. doesthisafteritstops

I'm at a loss here. I don't know how to investigate this. That I can't even get an error message of some kind makes this even more confusing.

daerath commented 7 years ago

Moving to a more appropriate project.