Will the circuit breaker trip if enough minor transient errors happen? Should frequency of exceptions or interval between exceptions be a factor?

App-vNext / Polly

Polly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner. From version 6.0.1, Polly targets .NET Standard 1.1 and 2.0+.

https://www.thepollyproject.org

BSD 3-Clause "New" or "Revised" License

13.17k stars 1.21k forks source link

Will the circuit breaker trip if enough minor transient errors happen? Should frequency of exceptions or interval between exceptions be a factor? #41

Closed TimGebhardt closed 8 years ago

TimGebhardt commented 8 years ago

Hi,

The circuit breaker doesn't handle the scenario where we get a small amount of errors over an extended period of time. The circuit breaker state keeps just a simple count and breaks for a hard set amount of time once it's seen the set number of exceptions.

Say we're using Polly to wrap a call an external web service that errors out once an hour. After three hours we'll lock up hard, even though we probably don't want to break the circuit in this scenario.

Does it make sense to have some sort of expiration on when an exception happens so that their effect dampens over time?

michael-wolfenden commented 8 years ago

Agreed, I was thinking that you could specify another timespan that would reset the count if an exception hadn't occurred in that period of time.

Thoughts?

bunceg commented 8 years ago

Good idea, however could you have some internal checks on the handle exception to get round it....? You can argue that to keep it simple, you should implement it yourself in this layer instead, instead of adding the override to the circuit breaker?

Tbh, I would like to see the self-implementation of circuit breaker fork merged instead if that is in-line with your design, as that sort of covers this, ie do it yourself in a custom implementation, and also allows us to handle circuit breaks in a central place across several out-of-process services........ Though maybe my solution to tims request is the same for me... Haven't thought it through fully yet...😏

reisenberger commented 8 years ago

I'm not sure from @TimGebhardt's original question, if he was assuming there were successes in-between the failures across the three-hour period or not. My reading of the codebase was that the breaker state is Reset() after any successful call to the supplied action. (Have I got this right?). So, if [A] a service failed intermittently (with successes in-between) across three hours, the existing CircuitBreakerPolicy implementation wouldn't trip. If on the other hand [B] the service only received three calls across a three-hour-period and all failed, the existing CircuitBreakerPolicy would trip. (Seems to make sense - all calls have failed - but not sure how long one would choose to trip a circuit for on a service only averaging a call per hour?)

The question made me think tho about a frequency-based circuit-breaker (which Michael Nygard also hints at in his book). Something like: "If the policy experiences N exceptions in any period T, break the circuit for a given timespan" (eg "Break the circuit if we receive 3 exceptions in 30 seconds"). I have just firmed up a sketch I put together a few days ago (when first saw Tim's q) and committed to https://github.com/reisenberger/Polly/tree/FrequencyBasedCircuitBreaker - is this an avenue worth pursuing at all?

It essentially maintains a FIFO queue of the timestamps of the last N exceptions, and uses this quite simply to determine if N have occurred within the given timespan.

reisenberger commented 8 years ago

Or for an alternative mathematical averaging approach? (doesn't maintain a FIFO queue) https://github.com/reisenberger/Polly/tree/FrequencyBasedCircuitBreaker2

(I can see merits and demerits to all of the alternatives ...)

bunceg commented 8 years ago

Regardless of the mathematics, I think your approach to the problem is fine as it makes more sense to what you would want a circuit breaker to be in practice.

However, to refer back to the other part of my note, all of these circuit breakers are great for the simple one-process approach. IMO, the type of things you are circuit breaking (external services) are highly likely to be used across processes. Unless I've completely missed it, I don't think Polly handles this scenario.

So, should Polly implement this ? (I think not, as the backing store to hold this data, and the appropriate trigger logic for the breaker, is probably very system specific) but I'm not sure of the best way to go about implemented this outside of Polly and injecting the functionality into it's fluent based framework...

reisenberger commented 8 years ago

Hi @bunceg . If I understand you rightly, the scenario is that if services A, B and C (say) all use another service M, then if service A breaks its circuit to M, it could be useful for services B and C to know about this and break their calls to service M too? We went through a similar line of thinking, but concluded - is there a need? If service A cannot reach service M for a reason that would also prevent service B, won't service B just discover that of its own accord (and also circuit-break) in due course? Is the overhead of inter-process co-ordination (which might also fail) worth it? (And A could also fail to reach M for a reason which doesn't affect B ...)

We did, however, go through a similar line of thinking - that it could be useful to be able to inject state (as you say) into a CircuitBreaker - for manually breaking a circuit. Sam Newman's (highly recommended) book on microservices suggests one might manually break all circuits to a downstream service in order to upgrade that downstream service. I fretted for a while over the fact that we couldn't inject state into a Polly CircuitBreaker to achieve this. For now, we have however gone with the same pragmatic above logic - do we need to? If we build our processes robust enough to be able to survive downstream unavailability (whether unexpected or 'expected' eg due to upgrade), then that should be enough. But @michael-wolfenden - could being able to manually break a circuit (inject state) perhaps be a useful addition sometime?

bunceg commented 8 years ago

HI,

Yes you do understand, and we are following the same principle in that A, B or C will find out of their own accord eventually. I take on board your comment about the co-ordinator could fail, which is a real concern to our situation too, but IMO that is a decision for the implementing team based on the services involved and the consumers of those services.

However, your approach of injecting state may well solve the problem anyway - something else can check the shared state first and manually break the circuit and I like the way of doing this. Our implementation uses a master switch to avoid calling the service anyway but your option sounds a bit cleaner.

ps: thanks for the book recommendation, I'll add it to my collection with "release it!" :)

DarrellMozingo commented 8 years ago

How about using a MemoryCache for success/failure rates by default, and getting a failure ratio from that? We prefer to use ratios rather than hard numbers (and likewise, ratios when the circuit is in the half-open state). Then just a simple x% of requests in n minutes have failed, trip it. TTL on the cache can easily give you the n minutes automatically.

Also agree re not having a centralised way to notifying other services about failures. What if the connection from just Serivce A -> M goes down/degrades, but the rest are OK? Now you're telling services B & C that they can't talk to M even though they can. Dangerous and lots of code/systems overhead. They'll figure it out.

kristianhald commented 8 years ago

Hi @reisenberger. Thanks to the team for developing Polly. It is a nice library.

In regards to this issue, I concur it would be nice with a feature that would allow the circuit to break if N exceptions occur over a period of T.

One example, where I would like a circuit to break is in the case, where a client is communicating with a server that is overloaded. The server would once in a while successfully handle a request, but most of the time return timeouts (As it will not be able to handle a request within the alloted time). Breaking the circuit for a little while, might help the server get it self up again.

Another example is the same as @TimGebhardt talked about where, during slow periods (Maybe nighttime), a single request is being retried and failing every once in a while, which breaks the circuit and can keep the circuit open, when good requests begin to filter in.

I looked at both implementations you made and think that both would solve the two examples above. The first implementation is easier to read. The second implementation requires are bit more thought to convience that there isn't an outlying case that could result in improper behavior.

Is this a feature that the team wants into Polly? Anything I can help with?

reisenberger commented 8 years ago

@kristianhald Thank you for your interest in Polly - and offer of help!

More sophisticated approaches to circuit-breaking are very much on the Polly roadmap (see towards end of doc) (for reasons TimGebhardt and you mention, and others)

Right now AppvNext are reflecting further on the best forward path for circuit-breaker: whether to go for the frequency-based and proportion-based circuit-breakers already mooted in the roadmap, and/or whether something more sophisticated along the lines of Hystrix’s time-slicing approach (somewhat similar to the suggestion from @DarrellMozingo above). Community views?

(may set out further views/options for feedback, later in this thread)

reisenberger commented 8 years ago

( and @kristianhald - we may yet take you up on your offer of help! )

kristianhald commented 8 years ago

Looked through the source for Hysterix regarding their Circuit Breaker implementation. As I read it, they use a bucket, where each item is a slice of the '_durationOfExceptionRelevance'. The precision is then defined by the size of the slice.

I think that the 'Frequency-based' solution is the same that Hysterix has implemented (Except that they look at percentages instead of the total number within the frequency given). The hubris in me, would state that the first implementation that @reisenberger has made with the queue of exception times, is time slicing where the window is very small and each window only contains a single information that an error occurred. :smiley:

I also like that the Hysterix implementation, before triggering the circuit breaker, checks if the number of requests received during the period is above a certain threshold (This is probably necessary, as else a single error will have a large impact on the error percentage, if the throughput is low).

// check if we are past the statisticalWindowVolumeThreshold
if (health.getTotalRequests() < properties.circuitBreakerRequestVolumeThreshold().get()) {
    // we are not past the minimum volume threshold for the statisticalWindow so we'll return false immediately and not calculate anything
    return false;
}

Personally, the few times I have used circuit breaker, it has mostly been leaning towards the frequency based. It is probably a personal preference, because I always like comparing stuff that happens with when they happened.

reisenberger commented 8 years ago

@kristianhald Many thanks for your input!

The Polly circuit-breaker internals have evolved since last August, and I pushed some new prototypes last couple days to https://github.com/reisenberger/Polly/tree/CircuitBreakerSpikeMarch16 . This includes a new sketch closer to the Hystrix approach. @kristianhald / wider community: very happy for any feedback on these. (New variants are also optimised for performance over the previous, and factor out metrics controlling the circuit more clearly.)

ConsecutiveCountCircuitController: the existing Polly circuit-breaker logic.

FrequencyCircuitController: as discussed earlier in this thread and in the Polly roadmap. Triggers on >N exceptions in a period T.

ProportionCircuitController: as in the Polly roadmap. Triggers on >N exceptions in T calls.

Commentary: An advantage of the FrequencyCircuitController and ProportionCircuitController is that both are simple to understand. Some disadvantages:

a pure proportion/ratio-based circuit breaker may not adapt well to the kind of ‘slow period’ referred to earlier in this thread. 7-in-10 failures occurring widely spaced in a slow period –> you might not want to break circuit. A minimum throughput could help here.
a pure frequency-based circuit breaker potentially has problems at the opposite end of the scale – it doesn’t adapt well to sudden increases in load. For instance, say you initially set your circuit-breaker to trip on >5 failures per T. That might have been a significant threshold at the time of configuration. If the system in question goes through a serious load increase (say, peak sale period), suddenly you may be getting (and managing to handle) 75 requests per T, of which only 5 are failing (a small proportion) – you might not want to circuit-break on only 5 failures at that throughput (and kill off the other 70 requests which could have got through successfully). One could argue this was misconfiguring the tripping threshold in the first place, but the concept remains: pure error frequencies (N per time T) as a measure, are vulnerable to not scaling with load increases or decreases.

... which brings us to...

TimesliceCircuitController: This is an early cut at a ratio-within-timeslice approach. @kristianhald, it brings in the minimum throughput threshold idea too.

Commentary: Initially harder to understand. But the approach deals well with both quiet period and high densities (combining as it does elements of both ratio and frequency measures). The approach may be better suited to higher throughput systems because of the time-slicing, tho user has control over duration of slice.

Community views on all these options welcome

reisenberger commented 8 years ago

@kristianhald If you are interested in helping/contributing, I can set out what would be needed to bring one or more of the above variants to production in Polly (wiring up the configuration; unit tests), and you can state what you might be interested in taking on? ... The unit tests for any of the above time-based policies will be quite fun :smiley_cat:, using Polly’s SystemClock concept to manipulate time to simulate scenarios.

(AppvNext efforts at the moment are focused on #14 (a major step to unlocking some of the other elements on the roadmap), and I will have extra time challenges from April.)

As a first step also, the wider AppvNext team must decide how many of the above variants to bring to market. As ever, balance to be struck between power/options and simplicity.

Community feedback welcome again: would some of the circuit-breaker variants in previous post be more useful than others?

reisenberger commented 8 years ago

Further development effort has been made on TimesliceCircuitBreaker . The https://github.com/reisenberger/Polly/tree/CircuitBreakerSpikeMarch16 spike now wires up the TimesliceCircuitBreaker syntax and adds unit tests. The implementation does not at this time (in the hystrix manner) subdivide the timeslice into further statistical buckets, for 'smoothing' of the statistics at timeslice rollover (though this could be added). Comment welcome.

kristianhald commented 8 years ago

Woaw, nice progress!! A bit faster than I expected. Here I come and say that I want to help and then I am nowhere to be seen. I apologise for that. Enough selfpity.

I think the TimesliceCircuitController looks good. The OnActionSuccess not checking, if the failureThreshold has been reached got me thinking. There might be a situation, where 3 failures will occur followed by a success, which would result in 75% errors and the throughputThreshold of 4 having been reached, which would mean that it should fail. However, after a bit of thought, I came to the conclusion that it did not make sense to open the circuit on a success even if the failureThreshold has been reached. What do you think?

Also I am a bit uncertain if the rollover is necessary and how hard it will be to understand. Without it will mean that in worst case, it will take up to 2*timesliceDuration, before the circuit opens.

A scenario could be where the timesliceDuration is set to 5 minutes and 1 minute into the slice, failures begin to occur, but just enough to be below the failureThreshold. The slice resets and the failures continue to occur. I think that it will depend on what the timesliceDuration is going to be set to (seconds, minutes or hours) and how fast the circuit breaker should react. For what I work on, rolling slices is not critical, as it will just mean that Polly will react on the failure state a bit later. Being able to specify ratio and timesliceDuration is already a long way.

I know I said it before, but I have more time on my hand now. If there anything you need fixed before this can be merged, then please throw it to me.

reisenberger commented 8 years ago

@kristianhald Thanks for your interest and involvement with this feature! If you have time available, any contribution to add the rolling slices/buckets within timesliceDuration (making it more a sliding statistical window) would be valuable. Shout if you can work on that - and if any thoughts/questions around it. (We could operate/operate initially without it, but it would improve the responsiveness of the breaker.)

Nice observation about OnAccessSuccess! Will comment further shortly.

Re:

what the timesliceDuration is going to be set to (seconds, minutes or hours)

... I have thoughts exactly about this - about the most appropriate way to configure this kind of circuit-breaker. I will add thoughts either here or in draft documentation (review will be welcome...), in a few days' time.

Thanks!

reisenberger commented 8 years ago

@kristianhald Here are some specific thoughts around adding the rolling slices/buckets within timesliceDuration ... if you are able to work on it!

[1] The number of buckets that timesliceDuration is divided into internally, need not (I am thinking) be user-configurable at this stage. For example, we could just pick a number (4? 5? 8? 10?) which significantly improves the responsiveness/fidelity of the circuit-breaker to its configuration. eg 5 = rolling-throw-away 20% of the statistics at a time (a great improvement on 100%-throw-away; likely to cover most fidelity situations; strikes a balance between fidelity and extra computation). What do you think?

Regarding not user-configurable: Simplicity of configuration is a Polly goal. My feeling is the extra configurability of number of internal buckets would not significantly add feature benefit versus the extra comprehension load. Our aim is just improve breaker fidelity-to-configuration to acceptable tolerance, any sensible value that achieves this is good ... (tho open to community feedback if anyone can make a case for configurable buckets ...)

(Hystrix probably cares about configuring this more, because they emit stats at bucket frequencies)

[2] Should we consider some minimum timesliceDuration below which we do not subdivide into further buckets? Or (alternative/similar) some minimum size for a bucket? (ie use fewer buckets - or none at all - if they would be smaller than that). What do you think?

Rationale: At some point there will be diminishing returns in the trade off between more fidelity/a more responsive circuit-breaker, versus higher computation. We could performance-measure to determine exactly that point ... it might even be quite small ... But keeping pragmatic, is there likely to be a need to fine-tune breaker responsiveness (what buckets affects) at sub-half-second or sub-quarter-second levels? (And if that fine-tuning is needed, reducing the overall timesliceDuration to those levels, without buckets, is probably equally effective.). I would be comfortable with a minimum half-second or quarter-second bucket-within-timeslice size.

(In summary ... Usually I do not like to impose decisions from within a library on the user. But for this already-quite-complex feature, I propose starting with some sensible/pragmatic decisions to reduce configuration complexity, and await community feedback (always welcome!) if refinement is needed... ... )

(EDIT: Configuration assumptions/limits/tolerances would also be documented.)

kristianhald commented 8 years ago

@reisenberger I can work on placing the health metrics in slices(buckets) to provide a rolling window for the circuit breaker.

The other day I made a quick POC based on the code you did, which includes a single test: https://github.com/kristianhald/Polly/commit/0b0bb45d28a0e7fe983a003c68ce6628740d792b The POC hardcodes the bucket size to 10, but this was only done because of the POC.

Would that be a way to go or were you thinking on going in a different direction? Do you have existing tests that measure performance, which I could use for inspiration?

kristianhald commented 8 years ago

eg 5 = rolling-throw-away 20% of the statistics at a time (a great improvement on 100%-throw-away; likely to cover most fidelity situations; strikes a balance between fidelity and extra computation). What do you think?

Any number higher than than 1 would indeed improve responsiveness. My initial thought was to have a quick calculation (Is only done once at the creation of the circuit breaker), which would give a larger bucket slice duration for larger windows with a minimum of 0.x seconds per slice (And maybe also a maximum). Something like 4Log(windowDuration * 500) which would give the following results:

Window duration	No of buckets(rounded)	Bucket timeslice
10 seconds	8	1.25 seconds
1 minute	13	4.62 seconds
10 minutes	23	26.087 seconds
1 hour	37	97.297 seconds

The reason for choosing a calculation like the above, is to state that if the window duration is long, then the responsiveness of the breaker is allowed to be lower. However, choosing a calculation that provides a good number for any input might be hard?

kristianhald commented 8 years ago

Regarding not user-configurable

I agree with the difficulty of using the feature, if the developer has to choose the bucket size or the bucket duration. I always like having the library providing me with reasonable defaults as I believe the developers of the library to have better knowledge of the area, than I do. Also if the default is based on the window duration provided, then a developer wanting more responsiveness can lower the window duration.

I do not believe it is necessary for a 1.0 version of this feature(because the feature goes a long way and the bucket duration can be controlled by the window duration), but it can be hard for a library developer to anticipate all usages of their library(Having worked on a few software projects, where a core library had to be built upon and extension was only possible at the hooks they provided) and therefore there might be cases, where a developer will need to override the default.

[2] Should we consider some minimum timesliceDuration below which we do not subdivide into further buckets? Or (alternative/similar) some minimum size for a bucket?

I cannot imagine (Not that someone probably would need it) needing a bucket duration less than quarter- or half-a-second. We are talking about an application that needs to cut off as soon as enough errors are encountered and cannot wait quarter- to half-a-second before opening the breaker. In most applications I do not think that this is an issue.

We also have to remember that we are using DateTime, which only has a resolution of 1-16ms depending on hardware and OS settings. I believe talking about bucket durations that are so low is an advanced topic.

I still believe that in the first version, the library should provide reasonable defaults as this will be the setting mostly used. Also adding this feature will require some thought into what would be allowed and what should not be allowed.

reisenberger commented 8 years ago

Hi @kristianhald. Thanks for all this thorough and detailed thinking; all very useful!

I made a quick [buckets] POC

Thanks - will look at this shortly!

I made a draft documentation (wiki page) earlier today at https://gist.github.com/reisenberger/d0ed99101634b0388dd7e3b92fbfadac .

@kristianhald Re the proposed logarithmic calculation of bucket size from stat window duration, see my comments within the draft doco about the possible downside of this kind of circuit-breaker with long statistical windows: the responsiveness of the circuit-breaker seems in the worst case (do you agree?) essentially proportional to the stat window duration, however finely you divide it into buckets, because of the 'long tail of successes' problem described. Is the logarthimic calculation of bucket size still worthwhile in light of this? I cannot imagine anyone really wants a circuit-breaker that doesn't react for 5 minutes, 10 minutes, 30 mins, whatever, to an 100% failure problem. Given the relatively narrow range of times this leaves for which the timesliceDuration probably makes sense (a few seconds to a minute or two), I wonder if it is adequate just to adopt a fixed number of buckets (unless makes buckets too small as prev discussed). (While I like the elegance of the logarithmic approach, we have to consider also that we have perhaps to explain it in doco; that somebody has to maintain this in future, or ask/seek to understand why it was done this way, etc). @kristianhald, perhaps really this is the same as you are also already saying: choosing a calculation that provides a good number for any input might be hard... :smiley:

@kristianhald You'll see in the doco that I have varied the terms slightly (AdvancedCircuitBreaker; samplingDuration). However, for now, in the forks we are working on, stick to current class/variable names if poss - this will keep forks easier to merge while work-in-progress. When we are feature complete, then we rename to some final naming before making the PR to master. Thanks! That said: feedback on choice of terminology all welcome!

reisenberger commented 8 years ago

@kristianhald To state more precisely part of my previous thinking: varying the size/number of buckets can increase the fidelity (fidelity-at-all-times) of the circuit-breaker to its configured thresholds, but cannot increase its responsiveness (overall speed of response) to the theoretical 'long-tail of successes' problem stated. (which also, after all, might not only be a theoretical problem in a lot of cases: some systems will behave exactly like this 100->0%! :smile_cat: )

I will continue thinking about this, but would be interested in your reaction to this problem.

Finally, regarding my comments about configuration values that are suitable / not-suitable: this is at this stage to share thinking and explore the contours of the problem, not suggesting yet to disallow values in code. However, we should indeed consider (later) the configuration limits for each parameter.

Again: very many thanks for the productive contributions!

reisenberger commented 8 years ago

Draft documentation (https://gist.github.com/reisenberger/d0ed99101634b0388dd7e3b92fbfadac) updated to be less prescriptive about which timesliceDurations make sense, because, as @kristianhald you rightly point out, we cannot anticipate all usages of the feature.

However, relationship between timesliceDuration and breaker responsiveness kept clear, because it is important that users seeking a responsive breaker do not fail to understand the possible implication of configuring a breaker in minute or hour periods.

kristianhald commented 8 years ago

@reisenberger Read the document and I entirely agree. I think that the naming is much better than what we have used before. I especially like samplingDuration.

This is because of the way a 'long tail' of successes affects statistics. Consider a circuit configured with a sampling duration of 5 minutes (not recommended), set to break at 50% failure threshold. Imagine everything has been working perfectly (100% success rate) for a while. 100% failures then start occurring (at roughly the same throughput). Such a circuit will take at least 2.5 minutes to reach 50% failure rate (to 'work off' the statistics from the 100% success era).

I did not take that into account, when looking at the value given to the samplingDuration variable. Its a bit interesting what happens, when both the failure threshold and the sampling duration can be changed. Setting the sampling duration low, but the failure threshold high means that the breaker will break if there is a small hiccup in the network. However, setting the sampling duration high, but the failure threshold low means small hiccups in the network does not affect the breaker, but issues like packet loss can trigger the breaker. If both are set high, then it just means that it should only break in case the other side is really down and setting both low means you are using a service, where you really do not want to hammer in case something is wrong and you want it to not do it quickly (Reminds me somewhat of scraping protocols :smile:).

the responsiveness of the circuit-breaker seems in the worst case (do you agree?) essentially proportional to the stat window duration, however finely you divide it into buckets, because of the 'long tail of successes' problem described.

Yes.

I wonder if it is adequate just to adopt a fixed number of buckets (unless makes buckets too small as prev discussed).

I think that a fixed number of buckets would be sufficient for most cases and if someone needs a more finegrained or coarsegrained bucket number, then the easiest and probably best solution, would be to overload the AdvancedCircuitBreaker syntax with an additional field for specifying bucket number. If a developer has this need, then it probably means the developer has enough knowledge about the circuit breaker to know the implications of this, I would assume.

To state more precisely part of my previous thinking: varying the size/number of buckets can increase the fidelity (fidelity-at-all-times) of the circuit-breaker to its configured thresholds, but cannot increase its responsiveness (overall speed of response) to the theoretical 'long-tail of successes' problem stated.

I was about to write, that I didn't agree, but having thought through different examples, the only example I can think of is the case where no communication has gone through the breaker, when suddenly communication begins and it fails. This is a very poor example.

I think that the number of buckets influence the circuit-breaker more or less depending on what the failure threshold and minimum threshold are set to (As you write). However, if the failure threshold is set low (< 50%) and the minimum threshold is set higher, then having only 1 bucket will reset the throughput counter every samplingDuration requiring the throughput counter to be met again, before the circuit-breaker acts.

I believe that the minimum threshold is the threshold that is the most affected by the bucket numbers, than the two other thresholds. Is that correct?

kristianhald commented 8 years ago

However, relationship between timesliceDuration and breaker responsiveness kept clear, because it is important that users seeking a responsive breaker do not fail to understand the possible implication of configuring a breaker in minute or hour periods.

Agreed. Just thinking about the scenarios using the current three variables, shows how difficult it is to select correct values for ones need. Selecting a bad combination of numbers can either result in the circuit-breaker opening too soon or too late.

Would it be an idea to describe in the documentation what happens, when varying the failure threshold and sampling duration. Thinking here about the four combinations that are the outer cases (high, low), (low, high), (high, high) and (low, low) and what would happen in the circuit-breaker?

reisenberger commented 8 years ago

@kristianhald Lots more useful thoughts - thank-you! I think we are shaping up what are the important practical limits on each parameter. I will summarise these shortly for review.

Re the POC sketch, I see no problems with this (nice refactor). I can see possible points-to-check later around possible/minor micro-optimisations [to consider if necess] and naming, but that best done after all the major conceptual issues and boundary issues (in discussion now) are resolved, implemented and tested/proved through testing.

Thanks for all the great contribution!

kristianhald commented 8 years ago

@reisenberger Made a new implementation, which is a bit more cleaned up version of the POC. Its located here: https://github.com/kristianhald/Polly/tree/CircuitBreakerSpikeMarch16_AddedWindows

I did not try to optimize the implementation, like as you said, we can do that when the fundamentals on done.

I went from using 'Bucket' to using 'Window' instead, as I felt it was a better name for it. The implementation looks alot like the POC. Also I added a few tests to ensure that the windows are being used by the implementation correctly.

Do you see some test cases that are missing, which I should add?

I was thinking, in regards to the issue with a low sampling duration and windows, maybe for simplicity only use rolling windows if the sampling duration is set high enough. It could be documented by stating that if sampling duration is set to x or higher, then rolling windows are used else a single window is used for the entire sampling duration.

What do you think?

reisenberger commented 8 years ago

Hi @kristianhald

new implementation [...] https://github.com/kristianhald/Polly/tree/CircuitBreakerSpikeMarch16_AddedWindows

Super; I will review further in the next few days.

reisenberger commented 8 years ago

Hi @kristianhald. Re:

I was thinking, in regards to the issue with a low sampling duration and windows, maybe for simplicity only use rolling windows if the sampling duration is set high enough. It could be documented by stating that if sampling duration is set to x or higher, then rolling windows are used else a single window is used for the entire sampling duration.

Yes, let's do this. I suggest herewith some decisions to keep us moving (tho comments certainly welcome if anyone sees other angles). Let us declare in the TimesliceCircuitBreaker some internal consts such as:

internal const int DefaultNumberOfInternalBuckets = 10; // or 'windows' ... align to preferred terminology
internal const long ResolutionOfCircuitTimer = TimeSpan.FromMilliseconds(20).Ticks;

and run some decision code as you suggest, something like

if (timesliceDuration < ResolutionOfCircuitTimer * DefaultNumberOfBuckets) { /* don't divide into windows */ } else { /* do divide into windows */ }

Can the operational code be done so that it doesn't look too messy / branching depending on whether it's using further buckets/windows or not?

Rationale: Per DateTime documentation, DateTime resolution is around 15-16ms, as you commented earlier. Let's round this up to TimeSpan.FromMilliseconds(20). There isn't any point in creating slices (or windows within slices) smaller than this, as it will just lead to empty slices/windows. As we are defining as a well-named const ResolutionOfCircuitTimer, it'll be easy to change later if needed, and easily visible if any user later has higher resolution requirements and investigates the code.

Similarly, in TimesliceCircuitBreaker/Async, let's change:

if (timesliceDuration <= TimeSpan.Zero) throw new ArgumentOutOfRangeException("timesliceDuration", "Value must be greater than zero.");

for something like:

if (timesliceDuration.Ticks <= TimesliceCircuitBreaker.ResolutionOfCircuitTimer ) throw new ArgumentOutOfRangeException("timesliceDuration", String.Format("Value must be greater than {0} milliseconds.  This is the minimum resolution of the CircuitBreaker timer.", TimesliceCircuitBreaker.ResolutionOfCircuitTimer /* converted to millisecnds! */));

These approaches could be refined later (for example, between 200ms and 20ms we could adopt an algorithm which provided the maximum number of buckets which kept bucket size over 20ms). But let's start instead from a premise 'keep things as simple as possible until we know they need to be more complicated'.

Regarding descending as far as TimeSpan.FromMilliseconds(20) rather than some arbitrary limit 1 second or 1/4 second, well: given we have the resolution, we may as well permit it: as you say, we can't foresee all uses of the library.

Does this sound like a good way to proceed?

Tomorrow hopefully I will add my promised comments/responses on the practical limits on each parameter and how they may interact. Have one more observation to add, but otherwise believe that is fairly thoroughly thought through now.

Thanks for the great collaboration!

EDIT: Draft documentation updated for these suggestions.

reisenberger commented 8 years ago

@kristianhald I made a timing test harness also for the circuit breakers here: https://gist.github.com/reisenberger/92dc8d73b4df127b296ed8daee3ed93d

The results on my current development machine are here: https://gist.github.com/reisenberger/a6fab34402731333a61600dc0f06d7b0

Later, we can use this for performance tuning, if/as necessary. At this stage, the intent was to determine that the impact of using-versus-not-using CircuitBreaker or TimesliceCircuitBreaker (both are tested) was negligible compared to the other tolerances we have been discussing. It is not a surprise to know this, but good to have it confirmed. The impact of using either circuit-breaker is (on my machine) about 3 to 6 millionths of a second per call (if anyone wants to double check the code/calculation I will be happy!). While other machines/servers etc may vary, this is clearly orders of magnitude different than typical network latency etc. And orders of magnitude faster than the dimensions we are setting on slices/buckets/windows (important to know no interference there).

[Results based on my original spike of the TimesliceCircuitBreaker; we should run this against the variants with buckets/windows also - but reassuring to know we are orders-of-magnitude insulated from interferance at mo]

If anyone can spot any thinking mistakes in the timing test harness, do shout!

Thanks

joelhulen commented 8 years ago

This is awesome stuff, you two! Sorry I've been unable to engage in the conversation, due to very tight commitments at the moment. I have been watching from the sidelines, but don't want to interject my opinions without thoroughly looking through the code and thinking through the scenarios.

However, I will take @reisenberger's tests for a spin and offer up the results, as well as any suggestions that could possibly be helpful.

You guys rock!

reisenberger commented 8 years ago

Addendum: my performance stats test only the synchronous versions of the circuit-breaker. We should also test the async versions, as there'll be the extra async/await overhead. It may be significant, but hopefully/probably not enough extra to touch the other configuration limits we have proposed.

reisenberger commented 8 years ago

@kristianhald Briefly re my previous comment:

Can the operational code be done so that it doesn't look too messy / branching depending on whether it's using further buckets/windows or not?

(BTW, this was just a general thought - not based on any reading of the code). Thanks!

reisenberger commented 8 years ago

@kristianhald I now feel relatively clear on the effects of setting different parameters near boundaries, as follows.

failureThreshold: Very low (eg 10%) will clearly cause the breaker to behave like a 'hair trigger', to break on almost the slightest error. Very high (eg 90%+) will cause the breaker to break almost only on underlying system completely unresponsive - similar to the original Polly breaker, but with the added benefit of samplingDuration cut-off. I am assuming users can deduce these self-evident effects of failureThreshold; in the interests of keeping documentation concise, don't plan to document.

samplingDuration: Low will hit circuit resolution (already documented). Proportional responsiveness of breaker means that high will cause slow responsiveness (already documented).

minimumThroughput: Low will cause coarse initial throughput resolution (already documented). Propose code forbids the value 1; minimum permitted is 2. @kristianhald A good observation where you alluded to the issue of possible miscalibration with minimumThroughput set high such that it might struggle to be reached within the samplingDuration. Have documented this more generally as keeping minimum throughput away from typical throughput.

Effect of buckets/windows versus not: I believe the effects are much as we have discussed, but - given the very low 200ms boundary, which few users are likely to work near - my instinct is keep things simple (thinking of the majority of users): state the boundary but not document elaborately. Most circuits governing downstream network calls will likely be working in timescales (eg seconds), clear away from the boundary. If higher frequency requirements emerge, we can refine as-and-when.

As a general comment: @kristianhald as you say, we cannot predict all ways that people will use the circuit-breaker. And performance characteristics of the underlying actions in question (for example whether more sssssssssssssssfffffffffffffffffffffffff [or] sssfsffsffsffsfsfssf) will also be a significant factor in the interaction between configuration and results. I think it is to be expected that users using a circuit of this sophistication (and with sufficient throughput to merit such a circuit) should expect to have to engage in some performance tuning in light of the real-life characteristics of their given system, and those characteristics are not something we can predict. However, we can warn users away from configurations which we know are likely to give unhelpful results.

This is my summary of configuration characteristics I see as worth documenting. Is this missing something major, some more complex interaction?

EDIT: To take the documentation away from too abstract discussion, I may add a 'suggested typical starting configuration' for downstream network calls.

kristianhald commented 8 years ago

I have updated my fork with some additional features. Below I will go through the changes as they all relate to comments you have provided @reisenberger. The code can be found here: https://github.com/kristianhald/Polly/tree/CircuitBreakerSpikeMarch16_AddedWindows

Rationale: Per DateTime documentation, DateTime resolution is around 15-16ms, as you commented earlier. Let's round this up to TimeSpan.FromMilliseconds(20). There isn't any point in creating slices (or windows within slices) smaller than this, as it will just lead to empty slices/windows. As we are defining as a well-named const ResolutionOfCircuitTimer, it'll be easy to change later if needed, and easily visible if any user later has higher resolution requirements and investigates the code.

I have created the internal constants and updated the syntax to use the resolution of the circuit as the minimum allowed timesliceDuration (sampleDuration).

Can the operational code be done so that it doesn't look too messy / branching depending on whether it's using further buckets/windows or not?

I have done circa the same, that you have done with the ICircuitController using a strategy based approach. I believe it is the cleanest and I thought that any performance issues with using an interface instead of the class directly could be measured before not selecting it.

There are currently two implementations (RollingHealthMetrics and SingleHealthMetrics). The selection is based on timesliceDuration < ResolutionOfCircuitTimer * DefaultNumberOfBuckets, selecting one or the other. There is a performance overhead for using interfaces. See test results in the bottom of this comment.

At the moment the decision on using one or the other strategy is happening in the controller, but depending on your view, the choice can easily be done where the circuit controller is being created and then injected. I have a preference, but in this case I think that its better you make that decision :smiley:

These approaches could be refined later (for example, between 200ms and 20ms we could adopt an algorithm which provided the maximum number of buckets which kept bucket size over 20ms). But let's start instead from a premise 'keep things as simple as possible until we know they need to be more complicated'.

I agree. Should not be hard to add. If we keep the current implementation, then its just another implementation of the interface and then doing a selection of when to use it (Probably when timeslice duration is between X * Resolution and Resolution * MaxNumberOfWindows.

@reisenberger @joelhulen Did a run with the code from the fork I am working on using my development host.

First result is where RollingHealthMetrics are used without inheriting from an interface: https://gist.github.com/kristianhald/534ed0540e8030060f1c49da04100e4a

Second result is still with RollingHealthMetrics, but it inherits an interface and the interface is used in the circuit controller(The strategy based implementation): https://gist.github.com/kristianhald/81d00afc036001afaac3c7d8927b6864

Using an interface decreases the performance, but only with around 10 ticks per iteration. I checked that on my machine there goes about '10000' ticks per millisecond, which means that a difference of 10 ticks is very low.

kristianhald commented 8 years ago

I now feel relatively clear on the effects of setting different parameters near boundaries, as follows.

I agree with every variable comment.

To take the documentation away from too abstract discussion, I may add a 'suggested typical starting configuration' for downstream network calls.

I looked at the draft documentation again and the part about configuration recommendations for the sampling duration requires some thought to it. I think that beginning the documentation with suggested configuration, which can allow the user to quickly be up an running and then having more detailed information later is a very good way of going :smile_cat:

reisenberger commented 8 years ago

@kristianhald Re:

updated my fork with some additional features

Thank you for this extremely productive contribution! Hoping to review and then we can pull this together for release very shortly (keen altogether to get this feature out this week or next, due to other commitments) (just need find time to review :smile: ).

reisenberger commented 8 years ago

@kristianhald

the part about configuration recommendations for the sampling duration requires some thought [...] beginning [...] with suggested configuration [...] then more detailed information later

Completely agree! Will adjust ...

reisenberger commented 8 years ago

@kristianhald Merged your work on RollingHealthMetrics to my local fork. Thanks for the great contribution!

@kristianhald @joelhulen Remaining to do (on my side) before this ready to PR against master:

minor tidies
rename TimesliceCircuitController etc to preferred terminology
remove breaker implementations (proportion, frequency) not currently intended for release
add readme.md doco [brief]
yaml etc

Hopefully in the next day or so ...

reisenberger commented 8 years ago

@kristianhald To offer some commentary on final decisions taken on previous points you raised:

At the moment the decision on using [RollingHealthMetrics and SingleHealthMetrics] is happening in the controller

Left this where you had it. Encapsulates the concern.

on my machine there goes about '10000' ticks per millisecond

See const TimeSpan.TicksPerMillisecond

after a bit of thought, I came to the conclusion that it did not make sense to open the circuit on a success even if the failureThreshold has been reached

I reviewed and agree with this. Breaking on a success seems counterintuitive. The circuit may have 100% recovered for now (sssssssssss...), in which case breaking would be counterproductive. If the circuit hasn't 100% recovered and is still intermitting (ssfsfsfsfsf), circuit will receive a failure soon enough within the same period, to break on.

@kristianhald Please feel free to review the final changes to https://github.com/reisenberger/Polly/tree/CircuitBreakerSpikeMarch16 if you have an interest. I intend to PR this to master later today.

kristianhald commented 8 years ago

See const TimeSpan.TicksPerMillisecond

Ahh, did not know that the ticks per millisecond for DateTime and TimeSpan is constant. Nice to know. Normally ticks per millisecond is defined by the CPU and OS based on the frequency they work on, which is why I provided the number of ticks :smile:

kristianhald commented 8 years ago

Please feel free to review the final changes to https://github.com/reisenberger/Polly/tree/CircuitBreakerSpikeMarch16 if you have an interest

Did a quick lookthrough the changes and I think they are good.

I intend to PR this to master later today

Cool. Out of curiosity, what is the procedure for nuget packaging the master branch?

reisenberger commented 8 years ago

what is the procedure for nuget packaging the master branch?

Merging to App-vNext/master (and earlier, creating the PR against it) automatically runs an AppVeyor verification build. We could push packages to nuget automatically on merging any PR, but opt to push them to Nuget manually. (Manual gives us the ability to occasionally merge PRs (eg readme doco fixes) without pushing a package and bumping rev number.)

@joelhulen owns this process, and can correct me if any of that is wrong / out of date.

reisenberger commented 8 years ago

@kristianhald PR to master in process. Thanks for your awesome contributions to this feature in code and thought.

joelhulen commented 8 years ago

@kristianhald Yes, @reisenberger has it right. The AppVeyor build process generates release-level nuget packages when we merge a PR. We control the version number via a file named GitVersionConfig.yaml, which is used for applying the release version for the signed and unsigned versions of the packages. We also update the Polly.nuspec file to include the change notes that are added as part of the package, as well as on the nuget.org site. Once the files are generated by the build server as artifacts, I manually upload them to nuget.org. As Dylan rightly states, this is to control the deployment of the packages.

reisenberger commented 8 years ago

A further post to this closed issue just to document an issue that was discussed between @kristianhald and @reisenberger during merging down on to @reisenberger's fork. This post copies parts of the discussion from there (https://github.com/reisenberger/Polly/pull/1) in case that is ever lost:

[The issue has no bearing on the operation of the v4.2 AdvancedCircuitBreaker (proved by unit tests), but would come into play if the Polly AdvancedCircuitBreaker were to want to evolve to emit health statistics at regular intervals.]

@reisenberger wrote:

Hey @kristianhald. Here is a conceptual issue. I observe that, from our implementation, the windows we create within the timeslice are not necessarily adjacent. For instance, imagining windows of size 1, if the first action records at t = 0, and the second action at t = 1.5, we will be running two windows from 0->1, and 1.5->2.5, with a 0.5 gap in between them. Reasoning suggests this has no effect on the statistics, as the gaps only represent 'empty' times with no results.

I considered an outside case: imagining windows of size 1 in slice size 10, we could have a window at t = 0 and another at t = 9.9, this latter in theory extending +1 to t = 10.9 ... Again, reasoning suggests that if any statistic comes in at t >= 10, the t = 0 window will be dropped, so at no point will we be considering statistics for a period >10. However, it might be nice (for regression value, in case anyone refactors...) to have a test proving this.

It looks like this could be done by duplicating the test Should_not_open_circuit_if_failure_threshold_exceeded_but_throughput_threshold_not_met_before_timeslice_expires__even_if_timeslice_expires_only_exactly() and adding SystemClock.UtcNow = () => time.Add(timesliceDuration.Subtract(TimeSpan.FromTicks(1))); just before the third exception? Would this cover it? It should cause it to create a new bucket which overlaps into the following timeslice ... but the t = 0 window should be dropped when the fourth exception comes in?

In general: Do you agree that the non-adjacent windows issue has no effect on statistics, or am I missing something?

Perhaps controversially, my first instinct is not to code to make windows adjacent before merging. It adds complication and doesn't add anything to solving the problem at hand. Following the mantra 'Keep the code as simple as possible to deliver the job needed', we don't need to correct for this, and the current creation/disposal of window code reads intuitively. However, there would be future scenarios (eg emitting regular statistics, as Hystrix does) where it would become important. And interested in your thoughts...

@kristianhald :

@reisenberger Don't have such a test and I believe that you are correct that we should have one to ensure that it works as expected. I think that the test would ensure the outer case is valid. I will add it immediately.

[this test was added and proves correct operation of the current statistics]

@kristianhald :

I am under the impression, like you state, that as long as we remove buckets/windows that are older than Now() - timesliceDuration then it will not have an impact. Also I felt that the implementation would be smaller and easier to read.

Did a little thinking and it might actually be quite easy to add the logic for adjacent buckets/windows. If we need the buckets to align adjacent to each other, then I think that the easiest solution would be to decide on a 'start time' (maybe construction time of the class or just zero) and when we create a new bucket, we move the 'StartedAt' timestamp backwards in time to where the bucket should have been created.

Lets take the following scenario: Start time: Could be decided at constructor time or just be zero (Beginning of the universe :smile_cat:). Lets say it is 23 seconds (from when the universe began). Bucket/Window duration: 0.7 seconds

At the current moment in time, DateTime.Now() reads 26.457 seconds and an error occurs. From 23 seconds and forward until 26.457 the buckets StartedAt reads: 23, 23.7, 24.4, 25.1, 25.8, 26.5

As the 26.5 bucket is in the future, then this cannot be created. Therefore the bucket we are in is 25.8. The question is then, how do we get from 26.457 to 25.8 without having to create all buckets from 23 seconds. The answer (I think :smile:. Just throwing a quick idea here) is the following calculation: StartedAt = Now() - ((Now() - StartTime) MOD BucketDuration)

If we take the above numbers we should get: StartedAt = 26.457 - ((26.457 - 23) MOD 0.7) = 26.457 - (3.457 MOD 0.7) = 26.457 - 0.657 = 25.8

It will still have empty buckets as holes, but it should align the buckets to each other.

Question: Would aligning the buckets actually solve the statistics issue?

eg emitting regular statistics, as Hystrix does

When you say 'regular' do you mean bucket/window duration regular, every 1 second regular or a delegate call, when an event occurs (Success or Error)?

@reisenberger:

At regular intervals (every call = too frequent). But the frequency of emitting these statistics would absolutely be aligned to either timesliceDuration or windowDuration. Definitely don't want some different system of timing to interact with ;~) Vague memory, Hystrix emits stats at one of these intervals I think.

Options for adding this (later) to Polly:

+emit a HealthMetric when dequeueing it from front of the queue; OR (more timely, fractionally more code) +emit a metric when it ceases being the _current

This essentially documents the issue, tho there is further implementation discussion at https://github.com/reisenberger/Polly/pull/1.

Implementation would be as @kristianhald suggested with: StartedAt = Now() - ((Now() - StartTime) MOD BucketDuration)

An alternative could be: StartedAt = Now() - ((Now() - StartTimeOfLastBucket) MOD BucketDuration) but this opens up complications across a circuit reset, when there isn't a last bucket.

reisenberger commented 8 years ago

Noting against this closed issue for future reference, possible optimisations that could be made in the AdvancedCircuitBreaker implementation in future:

Any optimisation decisions should be driven by need, and by running before-and-after performance stats, for example with the test harness here: https://gist.github.com/reisenberger/92dc8d73b4df127b296ed8daee3ed93d

Current performance:

Performance analysis of the AdvancedCircuitBreaker across one-hundred-thousand (100000) iterations, including frequent circuit-breaking, suggests introducing the AdvancedCircuitBreaker costs around 30 ticks (3 millionths of a second) per call. Performance for the original Polly CircuitBreaker broadly matches.

Possible optimisations:

[a] Replacing the use of Queue with an internal fixed-size array we manage ourselves. While this has little benefit in itself (the implementation of Queue is backed by a similar array...), it may open up [b].

[b] Avoid the generation (and subsequent garbage-collection once out of scope) of HealthCount instances. This could be done by retaining a fixed set of instances ourselves, in an array [a] we own, and blanking-and-reusing existing instances rather than letting them go out of scope and creating new.

[c] Maintain a running total of successes/failures across the timeslice, rather than re-summing it each failure (to be checked by performance analysis whether this optimises or not; it’s swings and roundabouts; the change would add a ++ to each success/fail; avoids larger summing each fail)

EDIT: As stated, it would have to be measured whether these optimisations add value. However, they are all items which Hystrix optimise away for their high throughput implementation.