JuliaParallel / Dagger.jl

A framework for out-of-core and parallel execution
Other
638 stars 67 forks source link

Thunk retry options and other supervisors #323

Open chris-b1 opened 2 years ago

chris-b1 commented 2 years ago

This may or may not make sense at the Dagger level, but for consideration - as an example copying the prefect keywords below

import Dates
import Dagger

function flaky_function()
    # accesses some network resource that could fail
end

res = Dagger.@spawn max_retries=3 retry_delay=Dates.Minute(1) flaky_function()
jpsamaroo commented 2 years ago

This is something we should have somewhere, but not necessarily in Dagger's core, since we may want more "supervisory actions" than just retries and delay-based retry. For example, we might want to trigger a retry based on an active signal (such as an error asynchronously delivered via a library API). Or we might want to retry with backoff, or do a more complicated set of failure recovery steps that depends on the state of multiple thunks.

Instead of building this in directly, this functionality could be implemented with a supervisor thunk which launches and monitors flaky_function:

function supervisor(f, args...)
    h = Dagger.Sch.sch_handle()
    res = nothing
    for i in 1:3
        try
            return fetch(Dagger.@spawn f(args...))
        catch err
            if i == 3
                rethrow(err)
            else
                @debug "Failed to execute $f on iteration $i, retrying in 1 second..."
                sleep(1)
            end
        end
    end
end

function flaky_function(x, y, z)
    if rand() < 0.5
        return x + y + z
    else
        error("Transient error")
    end
end

fetch(Dagger.@spawn supervisor(flaky_function, 1, 2, 3))

We could put such supervisor functions into their own package (which could be a subpackage of this repo), maybe DaggerSupervisors.jl?.