Closed vellad closed 8 years ago
EventDispatcher
still has hardcoded settings, but I wonder why a user needs to change them.
@Particular/servicecontrol-maintainers what do u think, can u see a reason to make thises settings configurable ?
The only good reason that comes to mind is that we have customers using brokers, such as RabbitMQ or SQL, on a not so reliable infrastructure. We added the ability to configure the circuit breaker to overcome to that issue, could this be affected?
@mauroservienti in that scenario I see us doing them a favor and pointing out issues with their own infrastructure. The current hardcoded times are IMO quite generous, we currently wait up to 2mins before arming the breaker, this is IMO quite a bit of time to wait for a network to respond, increasing this number even further could have dangerous effects.
Just in this subject, when communicating with a sql server cluster, some environments take time to fail over, 2 minutes might not be enough to recover.
On Wed, Feb 17, 2016 at 2:20 PM John Simons notifications@github.com wrote:
@mauroservienti https://github.com/mauroservienti in that scenario I see us doing them a favor and pointing out issues with their own infrastructure. The current hardcoded times are IMO quite generous, we currently wait up to 2mins before arming the breaker, this is IMO quite a bit of time to wait for a network to respond, increasing this number even further could have dangerous effects.
— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185433206 .
@yossita what time do you think it would be safer ? Would 5mins work ?
Not sure, this is why we asked it to be configurable :) We've seen different behaviours with different SQL Server versions, configuration and environments. The major issue we had was when the circuit breakers are triggered, a critical exception is thrown which causes the hosting service to crash.
On Wed, 17 Feb 2016 at 18:10 John Simons notifications@github.com wrote:
@yossita https://github.com/yossita what time do you think it would be safer ? Would 5mins work ?
— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185509640 .
The major issue we had was when the circuit breakers are triggered, a critical exception is thrown which causes the hosting service to crash.
@yossita that is actually on purpose, we want to raise awarness of it, so you can act on it. The issue is that if we cannot contact the transport, we are in essence broken and not processing messages, hence the monitoring capability is stopped! Maybe there is a different way of raising awarness?
Like logging an error?
On Wed, 17 Feb 2016 at 19:06 John Simons notifications@github.com wrote:
The major issue we had was when the circuit breakers are triggered, a critical exception is thrown which causes the hosting service to crash.
@yossita https://github.com/yossita that is actually on purpose, we want to raise awarness of it, so you can act on it. The issue is that if we cannot contact the transport, we are in essence broken and not processing messages, hence the monitoring capability is stopped! Maybe there is a different way of raising awarness?
— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185522845 .
Like logging an error?
I'm pretty sure we log errors, but those do not have enough impact and therefore get ignored. It could be that logging to a text file is not the right way of making an impact ? Maybe we need to log to the event viewer ? Or something else ?
It is hard for me to vouch what other customers are doing, but we route all nsb logs to a custom event log and to files (planning to ship them to splunk eventually). When an error is logged it raises alerts. Maybe leave as is is the best? Might as well push ops to stabilize the infra... On Wed, Feb 17, 2016 at 7:25 PM John Simons notifications@github.com wrote:
Like logging an error?
I'm pretty sure we log errors, but those do not have enough impact and therefore get ignored. It could be that logging to a text file is not the right way of making an impact ? Maybe we need to log to the event viewer ? Or something else ?
— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185525811 .
@Particular/servicecontrol-maintainers thoughts ?
I think @andreasohlund and I already had this conversation when discussing the health page concept a while ago
To me the idea that circuit breaker should completely take down SC is wrong. I'd like to see it take down the broken capability not the whole system. We could then show in the status of each component part and allow the operator to restarted it without requiring the intervention of a system admin. In production environments we should not assume the Service Pulse operator has rights to log on to a server and restart a service
Regarding above.. Obviously we'd still log to any and all mechanisms when a component has failed not just show it in Pulse
I'd like to see it take down the broken capability not the whole system
Yep, that is how Netflix works, and they have a very nice api for it, https://github.com/Netflix/Hystrix/wiki/Operations
Hey, again a circuit breaker issue :-)
+1 for making them configurable. remember that when a Windows Service is stopped it takes a minute before windows service recovery kicks in.
I've also raised this in the NSerivceBus repo, we should pause processing when infrastructure is down and continue when its available again. Custom checks check infrastructure and the result pauses/resumes message consumption.
You could eventually quit but as said cluster failovers can take a while. A regular roll-over usually is 'short' but when a node really goes down where a database has to recover from the journal in the peak of the day probably takes a bit longer....
I am increasing the default timeout as part of #722
The two circuit breaker settings are hard coded for Service Control to these values: timeToWaitBeforeTriggering: 2 mins delayAfterFailure: 20 secs
They are also hardcoded in the heartbeat plugin to these values: timeToWaitBeforeTriggering: 2 mins delayAfterFailure: 1 sec (default from circuit breaker class)
This could be the case in other plugins as well.
They need to all be configurable in code and/or a config file.