App-vNext / Polly

Polly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner. From version 6.0.1, Polly targets .NET Standard 1.1 and 2.0+.
https://www.thepollyproject.org
BSD 3-Clause "New" or "Revised" License
13.45k stars 1.23k forks source link

Polly in Azure durable functions (requires replace SystemClock with time service to be deterministic?) #535

Closed george-polevoy closed 5 years ago

george-polevoy commented 6 years ago

This a feature suggestion

In Azure durable functions, I need a policy which is completely deterministic, because it could run the same code while replaying past events, and policy must yield reproducible flow of execution. This means we can't use system clock and entropy (and any other kind of kernel resources for that matter).

For a policy to be deterministic and reusable inside the same process we need to use an instance of clock service, not a static function such as SystemClock.SleepAsync, which uses system clock internally.

reisenberger commented 5 years ago

Thanks @george-polevoy for the question around Polly and durable functions (sorry for delayed reply).

TL;DR You can use Polly in durable Activity functions as-is, but not in Orchestration functions.

My understanding is that the need for determinism in durable functions and the consequent time-provider recommendations apply only to durable Orchestrator functions as these are the type of durable function that is replayed. But the kinds of operation that Polly normally guards - any external I/O, any async HTTP calls etc - are specifically barred from use within orchestration functions.

On the other hand, durable Activity functions have no such constraints, and this is where I/O, HTTP calls etc should be placed. As activity functions are not replayed and don't have the same constraints, Polly should be perfectly safe to use (without modification) within Activity functions.


The Activity functions themselves which the orchestrators coordinate, could of course succeed or fail, so one could ask: is it possible to apply Polly-like controls to the way orchestrators execute activity functions? In most cases, durable functions provides equivalent functionality.

Retry: Durable functions provides its own in-built retry functionality including backoffs.

Timeout: Durable functions provides its own timeout pattern recommendation (example 1; example 2), analogous to Polly's pessimistic timeout.

Cache: Orchestrator functions already cache the results of activity functions.

Bulkhead isolation: is essentially a parallelism throttle. Durable functions provides its own concurrency throttles and scaling controls.

Circuit Breaker: is essentially about accumulated errors over time or across parallel processes. But orchestrator functions are isolated, single-thread processes - the only way they could accumulate success/failure statistics cross-process would be in some external store, but orchestrators are not allowed to communicate externally anyway (except via Activity functions) ... so the problem becomes circular. Dealing with replays through a circuit-breaker in an orchestration function would also present a challenge - when the orchestration was in replay state, the circuit-breaker would need to record neither a success or failure - because both influence statistics and the information would be in any case outdated. It's not impossible, but TL;DR it would be much simpler to simply push the circuit-breaker down into the Activity function. This is probably also only useful when distributed circuit-breaker becomes available.

Fallback and PolicyWrap would work in orchestrators - but on their own (without the other policies) they are not that relevant.


Hope this helps. Interested if anyone can see angles on Polly-type behaviour in durable functions that this brief assessment misses.

TL;DR Use Polly in the Activity functions rather than the Orchestration functions.

reisenberger commented 5 years ago

Closing this issue: no changes to Polly seem necessary to use Polly with Azure durable activity functions.

Noting for completeness/brief aside that changing Polly's SystemClock from an ambient context to an injected dependency could bring improvement to testing. However: a low priority, as no current production-side benefits to change (and we wouldn't want such a change to unnecessarily clutter Policy configuration syntax further, also).