Schedule callbacks to run asynchronously and in parallel

Using the callback dependency graph, it's possible to determine which callbacks access what state and which callbacks need to run before.

Hence it should be possible to run callbacks that access unrelated state in parallel or asynchronously (or both). For example, you may have an integration with a experiment tracking backend like Weights and Biases which needs to perform network requests. These could be run in the background as not to slow down the training loop.

In practice, with the information from the dependency graph and whether state access are read/write a Dagger.jl DAG could be constructed and run asynchronously. Callbacks that write state, like ToGPU would still be run synchronously.

There already exists an extension interface for how callbacks are executed in FluxTraining.CallbackExecutor. The default (and so for only) implementation performs a topological sort and executes the callbacks serially.

FluxML / FluxTraining.jl

Schedule callbacks to run asynchronously and in parallel #85