BrunoBonacci / safely

Safely is a Clojure's circuit-breaker library for handling retries in an elegant declarative way.
https://cljdoc.org/d/com.brunobonacci/safely
Apache License 2.0
230 stars 9 forks source link
circuit-breaker clojure exceptions exponential-backoff retry retry-policies

safely

CircleCI Clojars Project CircleCi last-commit cljdoc badge

Safely is a Clojure's circuit-breaker library for handling retries in an elegant declarative way.

The library offers out of the box:

Usage

Add the dependency into your project.clj.

;; stable version
[com.brunobonacci/safely "1.0.0"]

Require the namespace:

(ns foo.bar
  (:require [safely.core :refer [safely]]))

Then, make a call to a remote system:

;; wrap your critical calls
;; to external systems (api, db, etc)
;; into a `safely` block, and define
;; what to do in case of failures.

(safely
  (api-call "other-system")

  :on-error
  :max-retries 5
  :default   {:some :value})

This is a quick ref-card of all possible configurable options:


;;
;; all in one example
;;

(safely

 ;; code to execute
 (do (comment run something which can potentially blow))

 ;; exception handling
 :on-error

 ;; upon error return a default value
 :default "some value"

 ;; retry a number of times before
 ;; to give up or return the default value
 ;; use :forever for unlimited retries.
 :max-retries 5

 ;; between retries wait a fix amount of time (not recommended)
 :retry-delay [:fix 3000] ;; 3s in millis

 ;; or wait a uniform random range between :min and :max
 :retry-delay [:random-range :min 1000 :max 3000]

 ;; or wait a random amount of time with +/- a random variation
 :retry-delay [:random 3000 :+/- 0.35]

 ;; or wait an exponential amount of time with a random variation
 :retry-delay [:random-exp-backoff :base 300 :+/- 0.50]
 :retry-delay [:random-exp-backoff :base 300 :+/- 0.35 :max 25000]

 ;; or wait a given list of times with a random variation
 :retry-delay [:rand-cycle [50 100 250 700 1250 2500] :+/- 0.50]

 ;; you can provide a predicate function which determine
 ;; which class of errors are retryable. Just write a
 ;; function which takes an exception and return something
 ;; truthy or falsey.
 :retryable-error? #(not (#{ArithmeticException NullPointerException} (type %)))

 ;; valid values: :original, :wrapped, :legacy, (fn [exception] true)
 ;; If an exception is thrown it determine its value.
 ;; :wrapped refers to the ex-info exception throw by safely
 ;; :original refers ti the exception raised inside the block
 ;; :legacy is to maintain the behaviour of earlier versions (mix of the two)
 ;; You can provide a function to control the type of the exception thrown.
 :rethrow :legacy

 ;; you can provide a predicate function which determine
 ;; if the output of the body should be considered as a failed response
 ;; this can be useful when using safely with APIs which have a return
 ;; status for errors instead of exceptions. Two good examples are HTTP
 ;; status codes and polling API, in which you wish to slow down the polling
 ;; when the result of the previous polling doesn't contain records.
 :failed? #(not (>= 200 (:status %) 299))

 ;; to activate the circuit breaker just give a name to the operation
 :circuit-breaker :operation-name

 ;; *PLEASE NOTE*: the following options are ONLY used in conjunction with
 ;; a circuit breaker

 ;; control the thread pool size for this operation
 :thread-pool-size  10

 ;; control the thread pool queue size for this operation
 :queue-size        5

 ;; the number of request's outcome to be sampled for analysis
 :sample-size       100

 ;; the number of milliseconds to wait before giving up
 ;; NOTE: it can be used only in conjunction with circuit-breaker
 :timeout           30000 ;; (millis, default no timeout)

 ;; What to do with the request when the timeout time is
 ;; elapsed. :never, :if-not-running or :always
 :cancel-on-timeout :always

 ;; stats are collected about the outcome of the operations
 ;; this parameter controls the number of 1-sec buckets
 ;; to control.
 :counters-buckets  10

 ;; the strategy used to trip the circuit open
 :circuit-breaker-strategy :failure-threshold

 ;; the threshold of failing requests after which the circuit trips
 ;; open. This is only used when
 ;; :circuit-breaker-strategy is :failure-threshold
 :failure-threshold 0.5

 ;; when the circuit breaker is tripped open, no requests will
 ;; be allowed for a given period.
 :grace-period      3000 ;; millis

 ;; the strategy to decide which requests to let through
 ;; for evaluation before closing the circuit again.
 :half-open-strategy :linear-ramp-up

 ;; the number of millis during which time an increasing number
 ;; of requests will be let through for evaluation purposes.
 :ramp-up-period    5000

 ;; General options.
 ;; customize your error message for logs
 :message "a custom error message"

 ;; set to false if you don't want to log errors
 :log-errors false

 ;; or choose the logging level
 :log-level :warn

 ;; to disable the stacktrace reporting in the logs
 :log-stacktrace false

 ;; whether to enable or disable tracking.
 ;; values: `:enabled` or `:disabled` (default: `:enabled`)
 :tracking :enabled

 ;; and track the execution time and outcome with the following action name
 ;; if not provided it will attempt to record the location (line + source file)
 :track-as ::action-name

 ;; a vector of key/value pairs to include in the tracking event.
 ;; They are useful to give more context to the event,
 ;; so that when you read the event you have more info.
 ;; for example:
 :tracking-tags [:batch-size 30 :user user-id]

 ;; is a function which returns the restult of the evaluation
 ;; and capture some information from the result.
 ;; This is useful, for example if you want to capture the
 ;; http-status of a remote call.
 ;; it returns a map or `nil`, the returned map will be merged
 ;; with the tracking event.
 :tracking-capture (fn [r] {:http-status (:http-status r)})
 )

Examples and Case studies

Here a collection of examples and case studies:

Exception handling

The macro safely will run the given code and in case an exception arises it will follow the policy described after the :on-error keyword.

Return default value

This is the simplest of the policies. In case of an exception with the given code a default value will returned.

;; no error raised, so result is returned
(safely
 (/ 1 2)

 :on-error
 :default 1)
;;=> 1/2

;; an error is raised, but a default value is given
;; so the default value is returned
(safely
 ;; ArithmeticException Divide by zero
 (/ 1 0)

 :on-error
 :default 1)
;;=> 1

Automatic retry

In some cases by retying a failed operation you can get a successful outcome. For example operations which involve network requests might time out of fail for transitory network "glitches". Typically, before giving up, you want to retry some operations.

For example, let's assume you wish to retrieve the list active users from a corporate RESTful webservice and you want to account for transitory failures, you could retry the operation a number of times before giving up.

The code could look like as follow:

;; Automatic retry
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3)

In this case :max-retries 3 means that there can be a maximum of 4 attempts in total. Between each attempts the thread will be sleeping for a random amount of time. We will discuss retry delays later on.

If the first attempt succeed, then the result of the web request is returned, however if an error arises then safely will retry until one of the following conditions is reached: either a the operation executes successfully, or the :max-retries is reached.

At the point the :max-retries is reached, if a :default value has been provided then it will be returned, otherwise the exception will be thrown up the stack.

;; Automatic retry with default value
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :default {:accounts [] :status "BUSY"})

In the previous case the HTTP GET operation may fail and it will be automatically retried for a maximum of 3 times, after which, the default value of {:accounts [] :status "BUSY"} is returned.

If the :default clause it is omitted the a clojure.lang.ExceptionInfo will the thrown with the details of the number of attempts and the original cause.

Retry delays and randomization

Self-emergent Behaviour

In large distributed systems failures can produce strange behaviour due to the fact that all participant act in the exact same way. Consider the example of a service failure where all other services which use the former detect the failure and decide to retry after the exact same amount of time. When the system comes back to life it will be flooded with retry requests from all the other services at the same time. If the number of client service is big enough can cause the service which is already struggling to die and reboot in a continuous cycle.

"Emergent behavior is that which cannot be predicted through analysis at any level simpler than that of the system as a whole. Emergent behavior, by definition, is what’s left after everything else has been explained" (Dyson and George 1997).

"Emergent behavior is also been defined as the action of simple rules combining to produce complex results" (Rollings and Adams 2003)

In this paper Emergent Behavior in Systems of Systems you can see more examples of emergent behaviour.

Retry policies

safely implements several randomization strategies to minimize the appearance of these large scale issues.

All delay strategies are randomized by default, here is a list of those we currently support.

The default configuration is: [:random-exp-backoff :base 300 :+/- 0.50 :max 60000]

Now we will show how each strategy works with code samples.

:fix

In this example safely will retry for a maximum of 3 times with a delay 3 seconds (3000 milliseconds) exacatly. This strategy is strongly discouraged in order to minimize self emergent behaviour.

;; Automatic retry with fix interval (NOT RECOMMENDED)
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :retry-delay [:fix 3000])

:random-range

In this example safely will retry for a maximum of 3 times with a delay of minimum 2 seconds (2000 milliseconds) and a maximum of 5 seconds (5000 milliseconds).

;; Automatic retry with random-range
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :retry-delay [:random-range :min 2000 :max 5000])

:random

In this example safely will retry for a maximum of 3 times with a delay 3 seconds (3000 milliseconds) and plus or minus an amount up to 50% of the base amount. This means that the waiting time could be effectively anything between 1500 millis (3000 - 50%) and 4500 millis (3000 + 50%).

;; Automatic retry with random-range
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :retry-delay [:random 3000 :+/- 0.50])

:random-exp-backoff

In this example safely will retry for a maximum of 3 times with a exponential backoff delay of 300 milliseconds and plus or minus random 50% of the calculated wait time. This means that the first retry will be ~300 millis (+/- random variation), the second retry will ~600 millis (+/- random variation) etc.

;; Automatic retry with random-range
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :retry-delay [:random-exp-backoff :base  300 :+/- 0.50])

The exponential backoff typically follows this formula:

delay = base-delay * 2 ^ retry [+/- random-variation]

NOTE: The random variation is added in a second step.

for a exponential back off for 3000 millis (3 sec) would be:

retry:     0       1       2       3       4 ...
formula:  3*2^0   3*2^1   3*2^2   3*2^3   3*2^4
delay:     3s     6s      12s     24s     48s

So for example for a given base you have the number of milliseconds of each subsequent retry:

Base Retry 1 Retry 2 Retry 3 Retry 4 Retry 5
50 50 100 200 400 800
100 100 200 400 800 1600
200 200 400 800 1600 3200
2000 2000 4000 8000 16000 32000
3000 3000 6000 12000 24000 48000

If you wish to check the sequence for a given base you can try on the REPL as follow:

(require 'safely.core)
(take 10 (#'safely.core/exponential-seq 2000))
;;=> (2000 4000 8000 16000 32000 64000 128000 256000 512000 1024000)

The randomization is applied after the exponential value has been calculated

If you want to simulate the random variation as well write as follow:

(require 'safely.core)
(->> (#'safely.core/exponential-seq 2000)
  (map #(safely.core/random % :+/- 0.50))
  (take 10))
;; => (2488 2152 6072 11159 46051 60235 65198 231233 573339 518515)

NOTE: Every execution will return different numbers.

:random-exp-backoff (with :max)

Additionally you can specify a maximum amount of time which beyond which you want to wait for a similar amount of time.

;; Automatic retry with random-range with a max delay
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 10
  :retry-delay [:random-exp-backoff :base  3000 :+/- 0.50 :max 240000])

The above example sets a maximum delay of 4 minutes (240000 millis) beyond which time safely won't backoff exponentially any more, but it will remain constant (with some random variation).

Example for the effect of :max 240000

(require 'safely.core)
;; without :max
(take 10 (#'safely.core/exponential-seq 3000))
;; => (3000 6000 12000 24000 48000 96000 192000 384000 768000 1536000)

;; with :max 240000
(take 10 (#'safely.core/exponential-seq 3000 240000))
;; => (3000 6000 12000 24000 48000 96000 192000 240000 240000 240000)

:rand-cycle

If you don't like the exponential backoff, then you can specify a sequence of expected delays between each retry. safely will use these times (in milliseconds) and add randomization to compute the amount of delay between each retry. Once last delay in the sequence is reached safely will cycle back to the first number and repeat the sequence.

;; Automatic retry with random list of delays
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 6
  :retry-delay [:rand-cycle [1000 3000 5000 10000] :+/- 0.50])

In the above example I've specified the desired waiting time (with variation) of 1s, 3s, 5s and 10s, I've also specified that I would like safely to retry 6 times, but only 4 wait times were specified. Safely will cycle back from the beginning of the sequence producing effective waiting times of:

retry:     1     2     3     4      5     6
delay:   1000  3000  5000  10000  1000  3000
         |---------------------|  |---------...
   cycling back to the beginning of the sequence

In this way you can specify your custom values which better suits your particular situation.

Errors logging

One common mistake is to have empty catch block. The exception in this case it is swallowed by the program without leaving any trace. There are very few occasion when this is a good idea, in most of the cases it is recommended to at least log the exception in a logging system. safely by default logs the exception with timbre. There are a few configurable option which you can leverage to make message more suitable for your situation.

We have:

For example this log the exception with the given message and a log level of :info.

;; Customize logging
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :message "Error while fetching active users"
  :log-level :info)

In this case we disable the error logging for the given block.

;; Disable logging
(safely
  (Thread/sleep 3000)
  :on-error
  :log-errors false)

It is possible to control the logging of the individual attempts by setting the following options:

All the :log-inner-* if no value is provided, they default to the value of the :log-* options. There are useful to reduce the log noise on individual attempts.

For example:

;; Customize logging
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :message "Error while fetching active users"
  :max-retries 3
  :log-level :error
  :log-inner-level :debug)

Will log the errors individual attempts as :debug level, but should all the attempts up to the :max-reties be exhausted then the final error is logged as :error level.

Automatic tracking (monitoring)

If you have (and you should) a monitoring system which track application metrics as well then you can track automatically how many times a particular section protected by safely is running into errors.

Tracking is enabled by default, but if you wish to disable it, set:

If you wan to track a particular section, all you need to do is to give a name to the section you are protecting with safely with:

For example:

;; Automatic retry with random-range
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :retry-delay [:random-range :min 2000 :max 5000]
  :track-as ::fetch-active
  :circuit-breaker :fetch-active-users)

This will track the call events providing a number of interesting information about this single block and publish them to a variety of monitoring systems.

For more information you can see the tracking page.

With μ/trace your safely expressions turn into traces which you can visualise with OpenZipkin compatible tracers.

Here is one example:

mulog tracing

Circuit breaker.

The circuit breaker functionality (introduced in v0.5.0) was popularised by M. T. Nygard's book "Release It!" and 2nd ed.. There are already a good amount of open-source libraries which offer quite good implementation of circuit-breakers as defined by Nygard. The most popular it is Hystrix from Netflix. However, Hystrix over the years became unnecessarily a huge library. safely offers an implementation of the same ideas in a much simplified way and 100% Clojure (for JVM).

If you want to know more about the general idea behind the circuit breaker I would recommend the book "Release It!" mentioned above. Here I'm going to describe how safely implementation works.

Internally the circuit breaker is a state machine which looks like this:

circuit breaker state machine

The state machine is initiated with the :closed state. Like an electrical circuit a closed circuit it is a working circuit in which the current can flow through.

:closed state

In this state the circuit breaker is allowing to pass all the requests. So when a new request is issued, the circuit breaker will retrieve the dedicated thread pool associated with this request type and enqueue the new request. Once enqueued an available thread will pick the request and process it. When the request is completed then the circuit breaker will update its internal state capturing the outcome of each request. In this case one of the following things can happen:

For any of the above outcomes the circuit breaker state machine updates a counter. Only counters for the last few seconds are kept and they are used by the state evaluation function to determine whether the circuit breaker should be tripped and move to the next state.

Currently the following strategies are available to trip the circuit breaker:

:open state

If the state evaluation function decides to trip the circuit off because too many errors occurred, then the circuit breaker state machine goes into the :open state. In this state all incoming requests are rejected immediately with a :circuit-open error and the standard error path with retries is followed.

This is useful to immediately reduce the load into the target system. The circuit stays open for a few seconds (according to :grace-period) and then the circuit automatically transitions to the :half-open state.

:half-open state

The purpose of this state is to assess whether the target system is back to normal before closing the circuit back and allow all the requests. So for this purpose the circuit breaker allows only a few requests to pass and it checks their outcome. If the system keep failing then the circuit goes back to the :open state, if the requests and now successful and the issue seems to be resolved then the circuit goes back to the :closed state. The same evaluation function used to trip the circuit open is used to evaluate whether now is back to normal.

During the :half-open state, only a part of the incoming requests will be allowed. The number of the requests allowed depends on the :half-open-strategy.

These are the currently supported strategies:

How to use the circuit-breaker

To activate the circuit breaker function just add the :circuit-breaker option if your safely options:

;; activating circuit breaker
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  ;; give a name to the circuit-breaker
  :circuit-breaker :fetch-active-users
  ;; optionally set a timeout for this operation (millis)
  :timeout 30000)

That's it!. safely in the background will create a thread pool named :fetch-active-users which will be in charge of processing the requests. You can use the circuit breaker in conjunction with all other safely options such as retry strategies, log and tracing.

NOTE: for every unique value passed to :circuit-breaker a number of resources need to be created in the system, namely the thread-pool and the circuit-breaker state machine. Therefore you must ensure that the values passed to the :circuit-breaker options are not randomly generated or high cardinality to avoid the risk of running out of memory in your system. Best practice is to name the circuit breaker after the operation that it is trying to accomplish.

Circuit breaker functions

shutdown-pools

For every named circuit breaker, safely will create its own dedicated thread pool. If you wish to shutdown the pool programmatically then you can call the shutdown-pools function with a specific circuit breaker name or without parameters to shut all of them down.

circuit-breaker-info

If you want to access the info stored in the state machine for monitoring purposes then you can use the circuit-breaker-info function with a circuit breaker name for the state regarding the specific circuit breaker or without parameters for all.

How to size the thread pool

You might think that a thread pool of 10 is very small for your system, and you might be tempted to increase this number by one order of magnitude. Although some times this is the correct thing to do, most of the time it won't be. The defaults are already set for large volume systems so most of you won't need to change the size of the thread pool and/or the queue length. However if you think you should change these values for your system I would recommend to use the Litlle's Law (from Queueing Theory) to choose the correct size.

The Little's Law says that the long term average number of items L in your system is equal to the average arrival rate λ multiplied by the long term average time W required to process that item, therefore:

Little's Law

The interesting property about the Little's Law is that it applies to the whole system as well as its individual parts. This means that this law will apply to your system as a whole, meaning all the instances of your system in the cluster, as well as the individual instances. Moreover, if your single instance has two possible paths with two different probabilities, it will apply to these sub-parts as well with the parameters adjusted accordingly.

For example if you have a system which processes 5000 requests/second as a whole, and you have 15 instances to serve these requests, and each requests takes on average 25 milliseconds, then we can reason as follow:

As you can see although your system as a whole processes a lot of requests per seconds, the individual instance concurrent load Li it will be within the range of the thread pool. If we size the thread pool a bit larger to cope with requests bursts and we add a small queue typically 30%-50% of the thread pool size we can ensure that occasional hiccups and bursts of requests are handled properly without causing the circuit breaker to trip over.

I hope this small guide helps you to correctly size your system. Anyway, always use measurements (tracking, monitoring) to compute the right size and verify you changes according to your assumptions to see if the change had the effect you hoped.

Macro vs function

safely it's a Clojure macro which wraps your code with a try/catch and offers a elegant declarative approach to the error management. However in many cases macro can't be used easily for this reason we provide a function as well.

Everything you can do with the macro safely you can do with the function safely-fn which takes a thunk (function with zero arguments and the same options with safely takes after the :on-error clause.

So for example this is the use of the macro you have seen so far:

;; Automatic retry with random-range
(safely
  (http/get "http://user.service.local/users?active=true")
  :on-error
  :max-retries 3
  :retry-delay [:random-range :min 2000 :max 5000])

This is the same example but with the safely-fn instead:

;; Automatic retry with random-range
(safely-fn
  (fn []
    (http/get "http://user.service.local/users?active=true"))

  :max-retries 3
  :retry-delay [:random-range :min 2000 :max 5000])

Note the use of the thunk to wrap the code and the absence of the :on-error keyword.

Testing and the sleepless-mode

If you are writing automated test but you don't want to wait then you can enable the sleepless-mode in order to skip the waiting times of the retry for example:

This might wait up to 40s before returning "".

;; this might wait up to 40s before returning ""
(safely
  (slurp "/not/existing/file")
  :on-error
  :max-retries 5
  :default "")

This one does the same number of retries but doesn't sleep and it returns immediately (same code path, but no sleep).

;; This one does the same number of retries but doesn't sleep
(binding [safely.core/*sleepless-mode* true]
  (safely
    (slurp "/not/existing/file")
    :on-error
    :max-retries 5
    :default ""))

License

Copyright © 2015-2024 Bruno Bonacci

Distributed under the Apache License v 2.0 (http://www.apache.org/licenses/LICENSE-2.0)