Closed mikeminutillo closed 8 years ago
@MarcinHoppe this is the issue that @andreasohlund was talking about.
@mauroservienti @sfarmar @ramonsmits as you've had recent dealings with customers about SC I'd appreciate your input here
Heartbeats have another issue as well which is that they can bank up in the ServiceControl queue if the transport does not handle TTBR well. https://github.com/Particular/NServiceBus.SqlServer/issues/158
I observed this happening last week, the failure rate was not really high but the log was filled with concurrency errors due to heartbeats.
Question: why are we updating a single document when dealing with incoming heartbeats instead of relying on an asyn index to gather heartbeats stats? If this is the case.
@mauroservienti I wonder then how much this is a real problem and how much it is just log noise. Would simply upping the log level to DEBUG resolve a lot of this?
Question: why are we updating a single document when dealing with incoming heartbeats instead of relying on an asyn index to gather heartbeats stats? If this is the case.
I suspect due to the low latency nature of the feature. If we decide that an endpoint is down then we need to notify users immediately. We could still do that with an in memory solution and persist at state transitions though.
i.e. When we detect an endpoint has stopped heartbeating, update it's status in memory, persist to Raven and (if monitored) raise an event. When we detect an endpoint has started beating, update it's status, persist to Raven and (if monitored) raise an event. When a user tells us to start/stop monitoring an endpoint, update it in memory and persist in Raven. Every 2 minutes, persist in Raven.
Those persistence steps need to be serialized or we'll get concurrency exceptions again.
Realistically the every 2 minutes thing is only needed for "Last Seen". How much value do we get from that. Is it shown in the UI? If we were to lose that during a restart, we'd either immediately be getting heartbeats from the live endpoint or we'd be getting no heartbeats from a dead endpoint. If we are down for 20 minutes, does it matter when we last saw an endpoint before this session?
Should this be labeled as a bug? Should we patch this?
Yes we are patching it
Yes we are patching it
Can assign a milestone to this one? (https://github.com/Particular/PlatformDevelopment/issues/596)
Yes, once we know when is going out, dependent on #640 being complete
@andreas the SC release process is quite a bit less strict than the Core one because we have no need to support older versions. As a result when you say "are we patching this" and John says yes, it just means that it will go into a release when it is ready.
Good point, so how do we know that this is a prio?
vnext
milestone? label? the "maintainers just know"? other idea?
"maintainers just know"
Dangerous, IMO. Doesn't the waffle board sort order already define this? (Just asking, maybe the sort order us not even persisted)
Symptom
Heartbeats processing starts to fall behind. Exceptions are logged into the log file. Eventually ServiceControl performance starts degrading, making the monitoring side of ServiceControl to not be accurate.
Who's affected
Systems with large number of endpoints that have heartbeats enabled, with short heartbeat interval.