Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
52 stars 47 forks source link

Under high load heartbeats can cause concurrency issues #639

Closed mikeminutillo closed 8 years ago

mikeminutillo commented 8 years ago

Symptom

Heartbeats processing starts to fall behind. Exceptions are logged into the log file. Eventually ServiceControl performance starts degrading, making the monitoring side of ServiceControl to not be accurate.

Who's affected

Systems with large number of endpoints that have heartbeats enabled, with short heartbeat interval.

mikeminutillo commented 8 years ago

@MarcinHoppe this is the issue that @andreasohlund was talking about.

@mauroservienti @sfarmar @ramonsmits as you've had recent dealings with customers about SC I'd appreciate your input here

Heartbeats have another issue as well which is that they can bank up in the ServiceControl queue if the transport does not handle TTBR well. https://github.com/Particular/NServiceBus.SqlServer/issues/158

mauroservienti commented 8 years ago

I observed this happening last week, the failure rate was not really high but the log was filled with concurrency errors due to heartbeats.

Question: why are we updating a single document when dealing with incoming heartbeats instead of relying on an asyn index to gather heartbeats stats? If this is the case.

mikeminutillo commented 8 years ago

@mauroservienti I wonder then how much this is a real problem and how much it is just log noise. Would simply upping the log level to DEBUG resolve a lot of this?

Question: why are we updating a single document when dealing with incoming heartbeats instead of relying on an asyn index to gather heartbeats stats? If this is the case.

I suspect due to the low latency nature of the feature. If we decide that an endpoint is down then we need to notify users immediately. We could still do that with an in memory solution and persist at state transitions though.

i.e. When we detect an endpoint has stopped heartbeating, update it's status in memory, persist to Raven and (if monitored) raise an event. When we detect an endpoint has started beating, update it's status, persist to Raven and (if monitored) raise an event. When a user tells us to start/stop monitoring an endpoint, update it in memory and persist in Raven. Every 2 minutes, persist in Raven.

Those persistence steps need to be serialized or we'll get concurrency exceptions again.

mikeminutillo commented 8 years ago

Realistically the every 2 minutes thing is only needed for "Last Seen". How much value do we get from that. Is it shown in the UI? If we were to lose that during a restart, we'd either immediately be getting heartbeats from the live endpoint or we'd be getting no heartbeats from a dead endpoint. If we are down for 20 minutes, does it matter when we last saw an endpoint before this session?

andreasohlund commented 8 years ago

Should this be labeled as a bug? Should we patch this?

johnsimons commented 8 years ago

Yes we are patching it

andreasohlund commented 8 years ago

Yes we are patching it

Can assign a milestone to this one? (https://github.com/Particular/PlatformDevelopment/issues/596)

johnsimons commented 8 years ago

Yes, once we know when is going out, dependent on #640 being complete

mikeminutillo commented 8 years ago

@andreas the SC release process is quite a bit less strict than the Core one because we have no need to support older versions. As a result when you say "are we patching this" and John says yes, it just means that it will go into a release when it is ready.

andreasohlund commented 8 years ago

Good point, so how do we know that this is a prio?

vnext milestone? label? the "maintainers just know"? other idea?

mauroservienti commented 8 years ago

"maintainers just know"

Dangerous, IMO. Doesn't the waffle board sort order already define this? (Just asking, maybe the sort order us not even persisted)

johnsimons commented 8 years ago

Fixed in https://github.com/Particular/ServiceControl/pull/664