failsafe-lib / failsafe

Fault tolerance and resilience patterns for the JVM
https://failsafe.dev
Apache License 2.0
4.16k stars 295 forks source link

Support accrual failure detection #346

Open jhalterman opened 1 year ago

jhalterman commented 1 year ago

As Failsafe already supports policies that are useful for networked operations, it would make sense to support phi accrural (or other accural algorithms) failure detection for situations where fixed timeouts don't adequately account for changing load conditions.

This could be implemented as a new policy which measures execution times over a number of executions, to determine if some threshold is crossed which represents a failure. Phi accrual could be one strategy supported by the policy, but there could be others. When the threshold is crossed, a fallback-like function could be called, for example, to fail over a system from one node that has failed to another. In that sense, the policy would be like a time-based fallback (rather than result based), except unlike a fallback it would be stateful.

Alternatively, this could be implemented as a Timeout option, where the timeout is stateful and adapts to execution time distributions.

One open question for this policy is, similar to a circuit breaker or rate limiter, at what point should it "reset" after triggering a failure, or should it even reset?

Any ideas for how this should work or what the policy should be named are welcome!

Tembrel commented 1 year ago

accural -> accrual

jhalterman commented 1 year ago

For some reason my fingers always struggle with that one :)

Tembrel commented 1 year ago

😂 and it's still not right!

On Sat, Sep 17, 2022, 6:18 PM Jonathan Halterman @.***> wrote:

For some reason my fingers always struggle with that one :)

— Reply to this email directly, view it on GitHub https://github.com/failsafe-lib/failsafe/issues/346#issuecomment-1250148697, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABZ5SSDAYXVVYQMELURVX3V6Y7SJANCNFSM6AAAAAAQPFKLGI . You are receiving this because you commented.Message ID: @.***>

jhalterman commented 1 year ago

This is definitely a sign that the new policy should not be named PhiAccrual :) I like the idea of thinking about a new policy more generally, as something that measures a series of execution times, where phi accrual is maybe just one strategy for determining if those times represent a failure.