Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
51 stars 47 forks source link

Circuit breaker settings are hard coded in Service Control and plugins #484

Closed vellad closed 8 years ago

vellad commented 9 years ago

The two circuit breaker settings are hard coded for Service Control to these values: timeToWaitBeforeTriggering: 2 mins delayAfterFailure: 20 secs

They are also hardcoded in the heartbeat plugin to these values: timeToWaitBeforeTriggering: 2 mins delayAfterFailure: 1 sec (default from circuit breaker class)

This could be the case in other plugins as well.

They need to all be configurable in code and/or a config file.

johnsimons commented 8 years ago

EventDispatcher still has hardcoded settings, but I wonder why a user needs to change them. @Particular/servicecontrol-maintainers what do u think, can u see a reason to make thises settings configurable ?

mauroservienti commented 8 years ago

The only good reason that comes to mind is that we have customers using brokers, such as RabbitMQ or SQL, on a not so reliable infrastructure. We added the ability to configure the circuit breaker to overcome to that issue, could this be affected?

johnsimons commented 8 years ago

@mauroservienti in that scenario I see us doing them a favor and pointing out issues with their own infrastructure. The current hardcoded times are IMO quite generous, we currently wait up to 2mins before arming the breaker, this is IMO quite a bit of time to wait for a network to respond, increasing this number even further could have dangerous effects.

yta1 commented 8 years ago

Just in this subject, when communicating with a sql server cluster, some environments take time to fail over, 2 minutes might not be enough to recover.

On Wed, Feb 17, 2016 at 2:20 PM John Simons notifications@github.com wrote:

@mauroservienti https://github.com/mauroservienti in that scenario I see us doing them a favor and pointing out issues with their own infrastructure. The current hardcoded times are IMO quite generous, we currently wait up to 2mins before arming the breaker, this is IMO quite a bit of time to wait for a network to respond, increasing this number even further could have dangerous effects.

— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185433206 .

johnsimons commented 8 years ago

@yossita what time do you think it would be safer ? Would 5mins work ?

yta1 commented 8 years ago

Not sure, this is why we asked it to be configurable :) We've seen different behaviours with different SQL Server versions, configuration and environments. The major issue we had was when the circuit breakers are triggered, a critical exception is thrown which causes the hosting service to crash.

On Wed, 17 Feb 2016 at 18:10 John Simons notifications@github.com wrote:

@yossita https://github.com/yossita what time do you think it would be safer ? Would 5mins work ?

— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185509640 .

johnsimons commented 8 years ago

The major issue we had was when the circuit breakers are triggered, a critical exception is thrown which causes the hosting service to crash.

@yossita that is actually on purpose, we want to raise awarness of it, so you can act on it. The issue is that if we cannot contact the transport, we are in essence broken and not processing messages, hence the monitoring capability is stopped! Maybe there is a different way of raising awarness?

yta1 commented 8 years ago

Like logging an error?

On Wed, 17 Feb 2016 at 19:06 John Simons notifications@github.com wrote:

The major issue we had was when the circuit breakers are triggered, a critical exception is thrown which causes the hosting service to crash.

@yossita https://github.com/yossita that is actually on purpose, we want to raise awarness of it, so you can act on it. The issue is that if we cannot contact the transport, we are in essence broken and not processing messages, hence the monitoring capability is stopped! Maybe there is a different way of raising awarness?

— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185522845 .

johnsimons commented 8 years ago

Like logging an error?

I'm pretty sure we log errors, but those do not have enough impact and therefore get ignored. It could be that logging to a text file is not the right way of making an impact ? Maybe we need to log to the event viewer ? Or something else ?

yta1 commented 8 years ago

It is hard for me to vouch what other customers are doing, but we route all nsb logs to a custom event log and to files (planning to ship them to splunk eventually). When an error is logged it raises alerts. Maybe leave as is is the best? Might as well push ops to stabilize the infra... On Wed, Feb 17, 2016 at 7:25 PM John Simons notifications@github.com wrote:

Like logging an error?

I'm pretty sure we log errors, but those do not have enough impact and therefore get ignored. It could be that logging to a text file is not the right way of making an impact ? Maybe we need to log to the event viewer ? Or something else ?

— Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/484#issuecomment-185525811 .

johnsimons commented 8 years ago

@Particular/servicecontrol-maintainers thoughts ?

gbiellem commented 8 years ago

I think @andreasohlund and I already had this conversation when discussing the health page concept a while ago

To me the idea that circuit breaker should completely take down SC is wrong. I'd like to see it take down the broken capability not the whole system. We could then show in the status of each component part and allow the operator to restarted it without requiring the intervention of a system admin. In production environments we should not assume the Service Pulse operator has rights to log on to a server and restart a service

gbiellem commented 8 years ago

Regarding above.. Obviously we'd still log to any and all mechanisms when a component has failed not just show it in Pulse

johnsimons commented 8 years ago

I'd like to see it take down the broken capability not the whole system

Yep, that is how Netflix works, and they have a very nice api for it, https://github.com/Netflix/Hystrix/wiki/Operations

ramonsmits commented 8 years ago

Hey, again a circuit breaker issue :-)

+1 for making them configurable. remember that when a Windows Service is stopped it takes a minute before windows service recovery kicks in.

I've also raised this in the NSerivceBus repo, we should pause processing when infrastructure is down and continue when its available again. Custom checks check infrastructure and the result pauses/resumes message consumption.

You could eventually quit but as said cluster failovers can take a while. A regular roll-over usually is 'short' but when a node really goes down where a database has to recover from the journal in the peak of the day probably takes a bit longer....

johnsimons commented 8 years ago

I am increasing the default timeout as part of #722