JuliaImageRecon / RegularizedLeastSquares.jl

MIT License
20 stars 9 forks source link

State-based multi-threading #90

Closed nHackel closed 1 month ago

nHackel commented 1 month ago

So far RegularizedLeastSquares.jl offers support for two types of multi-threading which are both transparent to the package.

The highest-level of multi-threading is just when a user creates multiple solvers in parallel (potentially with different parameters):

Threads.@threads for (i, A) in enumerate(operators)
  solver = createLinearSolver(CGNR, A, iterations = ...)
  push!(results, solve!(solver, b[:, i]))
end

Then we also support low-level multi-threading within the linear operators given to the package. This is just completely transparent to us. MRIReco for example uses both of these multi-threading options (with @floop instead of the Threads version).

This PR takes advantage of the split between a solver and its state to add middle-level of multi-threading. This level of multi-threading applies the same solver (and its parameters) to multiple measurement vectors or rather a measurement matrix B.

Solvers and their state are currently defined as follows:

mutable struct Solver{matT, ...}
  A::matT
  # Other "static" fields
  state::AbstractSolverState{<:Solver}
end

mutable struct SolverState{T, tempT} <: AbstractSolverState{Solver}
  x::tempT
  rho::T
  # ...
  iteration::Int64
end

While both the solver and its state are mutable structs, the solver struct is intended to be immutable during a solve! call. This allows us to simply copy the state, initialize each state with a slice of B and then perform iterations on each state separately. Since the solver has an AbstractSolverState as a field, we can exchange this for a new state holding all the usual states for each slice. While this is technically a type-instability, in the iterate method we always dispatch on the state-type too, so we are not affected by the type-instability in our hot-loops.

To make this feature optional and hackable to users, I've introduced a new keyword scheduler to the solve! call which gets passed on to the init! call if a user provides a measurement matrix. Scheduler defaults to a simple sequential implementation without multi-threading. Out of the box we also support multi-threading via Threads.@threads with scheduler = MultiThreadingState. The scheduler is treated like a callable that gets invoked with a copy of all necessary states.

As an example I sketch how to implement a custom @floop scheduler:

# Struct definition, we track both the states and if they are still active. Since most solver have conv. criteria, they can finish at different iteration numbers
# Atm the generic functions expect the scheduler structs to have a `states` and `active` property.
mutable struct FloopState{S, ST <: AbstractSolverState{S}} <: AbstractMatrixSolverState{S}
  states::Vector{ST}
  active::Vector{Bool}
  FloopState(states::Vector{ST}) where {S, ST <: AbstractSolverState{S}} = new{S, ST}(states, fill(true, length(states)))
end

# To hook into the existing init! code we only have to supply a method that gets a copyable "vector" state. This will invoke our FloopState constructor with copies of the given state.
prepareMultiStates(solver::AbstractLinearSolver, state::FloopState, b::AbstractMatrix) = prepareMultiStates(solver, first(state.states), b)

# This iterate function is called with the idx of still active states
function iterate(solver::AbstractLinearSolver, state::FloopState, activeIdx)
  @floop for i in activeIdx
    res = iterate(solver, state.states[i])
    if isnothing(res)
      state.active[i] = false
    end
  end
  return state.active, state
end

# To call this we now simply have to do
solver = createLinearSolver(CGNR, A, ...)
solve!(solver, B; scheduler = FloopState)

A solver is also not locked into always doing multi-threading once it was passed a matrix. It seamlessly switches back to normal execution when given a vector.

Note that switching out the state as is done here, also allows one to specify special variants of a solver as is done in this PR.

Furthermore, we can also now specify special algorithms variants that work directly on B and not slices of it. An example of this could be an implementation of Kaczmarz as described in this work.