Parallelization of integrand evaluations

lxvm commented 1 year ago

This is a draft pr to parallelize quadgk and quadgk! over evaluation points. Using a procedure similar to the one discussed by Gladwell in "Vectorisation of One-Dimensional Quadrature Codes" (https://doi.org/10.1007/978-94-009-3889-2_24), I remove all the segments in the heap that will need to be refined and then I evaluate the integrand at all the Kronrod points in all of the new segments in parallel. I decided to go for a simple Threads.@threads approach since I wasn't sure if using something more sophisticated such as Floops.jl would provide much more benefit. Here is an example showing some speed up on a simulated costly integrand

julia> using QuadGK

julia> s1(x) = sin(13x)     # integrand requiring subdivisions
s1 (generic function with 1 method)

julia> s2(x) = (sleep(0.005); s1(x))  # costly integrand
s2 (generic function with 1 method)

julia> Threads.nthreads()  # show threads
6

julia> p = QuadGK.Parallel(Float64, Float64, Float64) # construct parallelization buffer

julia> @time quadgk(s1, 0, 1)        # very fast
  0.000007 seconds (2 allocations: 368 bytes)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s1, 0, 1, parallel=p)  # slower because of threading overhead
  0.000239 seconds (336 allocations: 34.984 KiB)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s2, 0, 1)      # slow when sequential and costly
  0.663973 seconds (531 allocations: 16.312 KiB)
(0.007119478349984942, 8.85874706924028e-13)

julia> @time quadgk(s2, 0, 1, parallel=p)   # faster when parallel and costly
  0.056169 seconds (819 allocations: 50.031 KiB)
(0.007119478349984942, 8.85874706924028e-13)

TODO:

Decide on multithreading framework
Reduce allocations in parevalrule for InplaceIntegrand

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 99.31% and project coverage change: +0.28% :tada:

Comparison is base (cb745d4) 97.87% compared to head (d560001) 98.16%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #80 +/- ## ========================================== + Coverage 97.87% 98.16% +0.28% ========================================== Files 5 6 +1 Lines 471 599 +128 ========================================== + Hits 461 588 +127 - Misses 10 11 +1 ``` | [Files Changed](https://app.codecov.io/gh/JuliaMath/QuadGK.jl/pull/80?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaMath) | Coverage Δ | | |---|---|---| | [src/QuadGK.jl](https://app.codecov.io/gh/JuliaMath/QuadGK.jl/pull/80?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaMath#diff-c3JjL1F1YWRHSy5qbA==) | `100.00% <ø> (ø)` | | | [src/batch.jl](https://app.codecov.io/gh/JuliaMath/QuadGK.jl/pull/80?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaMath#diff-c3JjL2JhdGNoLmps) | `99.15% <99.15%> (ø)` | | | [src/adapt.jl](https://app.codecov.io/gh/JuliaMath/QuadGK.jl/pull/80?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaMath#diff-c3JjL2FkYXB0Lmps) | `96.66% <100.00%> (+0.27%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

stevengj commented 1 year ago

I think we just want to provide the caller with the array of points to evaluate and let them parallelize it (or not) using threads, the GPU, distributed-memory, or whatever.

One way to expose this API would be to define as a wrapper around the integrand, e.g.:

struct BatchIntegrand{F,X,Y}
    f!::F # in-place function f!(y, x) that takes an array of x values and outputs an array of results in-place
    max_batch::Int # maximum number of x to supply in parallel (defaults to typemax(Int))
    x::Vector{X}
    y::Vector{Y}
end
BatchIntegrand(f!::F, ::Type{X}, ::Type{Y}; max_parallel::Integer=typemax(Int)) where {F,X,Y} =
    BatchIntegrand{F,X,Y}(f!, max_parallel, X[], Y[])
BatchIntegrand(f!::F, x::Vector{X}, y::Vector{Y}; max_parallel::Integer=typemax(Int)) where {F,X,Y} =
    BatchIntegrand{F,X,Y}(f!, max_parallel, x, y)

It's a little annoying that it doesn't get X from the type of the endpoints. One possibility would to have a constructor:

BatchIntegrand(f!::F, ::Type{Y}; max_parallel::Integer=typemax(Int)) where {F,Y} =
    BatchIntegrand{F,None,Y}(f!, max_parallel, nothing[], Y[])

and then if the quadgk function sees a BatchIntegrand{F,None,Y} then it can replace it with BatchIntegrand{F,X,Y} where it gets X from the endpoint type.

lxvm commented 1 year ago

I wrote a draft implementation of BatchIntegrand that can be tested with the following example integrand

julia> using QuadGK, Unitful

julia> f(x) = exp(-(x/(1.0u"m"))^2)
f (generic function with 1 method)

julia> quadgk(f, 0.0u"m", Inf*u"m")
(0.8862269254527579 m, 8.595134596701424e-9 m)

julia> f!(y, x) = y .= f.(x)
f! (generic function with 1 method)

julia> quadgk(QuadGK.BatchIntegrand(f!, Float64, typeof(0.0u"m")), 0.0u"m", Inf*u"m")
(0.8862269254527579 m, 8.595134596694648e-9 m)

Introducing BatchIntegrand also changes the default behavior of quadgk when using relative tolerances, since it can simultaneously adapt all of the segments in the heap that surpass the "current" tolerance. Previously, the "current" tolerance was updated after any segment was refined, however now I changed the default behavior to match BatchIntegrand. I expect the computational effort to be the same if the caller performs singularity separation for their integrand. Otherwise, there are integrands with multiple peaks in an interval that previously only integrated one dominating peak and now can capture more or all of the peaks. For example:

julia> f(x)    = sin(x)/(cos(x)+im*1e-5)   # peaked "nice" integrand
f (generic function with 1 method)

julia> imf(x)  = imag(f(x))                # peaked difficult integrand
imf (generic function with 1 method)

julia> imf2(x) = imf(x)^2                  # even more peaked
imf2 (generic function with 1 method)

julia> x0 = 0.1    # arbitrary offset of between peaks
0.1

julia> using QuadGK

julia> quadgk(x -> imf2(x) + imf2(x-x0), 0, 2pi, rtol = 1e-5)   # previous versions
(235619.45750214785, 0.8317222055900457)

julia> quadgk(x -> imf2(x) + imf2(x-x0), 0, 2pi, rtol = 1e-5)   # with this pr
(628318.5307008666, 3.1850136117936505)

The implementation of BatchIntegrand is as a stateful iterator yielding segments because the state is in the buffer of integrand evaluations that may contain points corresponding to either less than or more than one segment. The heap also acts as a stateful iterator that pops off segments one at a time until the error goes below the tolerance. In order to reuse as much existing code as I could, which isn't much, I try to hide the fact that these iterators are stateful by collecting their results into tuples. I try doing so with recursion, however I believe this leads to a lot of code generation and there is still a lot of type instability within the iterators that I attribute to the variables I use for iterator state. I would appreciate any ideas for ways of mitigating type instability and thus allocations, as well as iterating only once over stateful iterators to avoid storing them in tuples.

lxvm commented 1 year ago

I simplified the implementation of BatchIntegrand by requiring that max_batch >= (4*order+2) and updating the integral estimate after each batch of GK rules is evaluated. So now, the batch integrator refines as many segments as need to be refined for the current tolerance up to the limit of how many fit into the batch buffer at once. Since this can affect numerical results obtained with a relative tolerance, I added a docstring example on the effect of max_batch.

When refactoring the original adaptation code, I also had to change the return type of adapt to handle type-unstable integrands, so I also put the final re-summation of the integral into its own routine.

If more simplifications are possible, then I can attempt implementing them. Otherwise, it might be possible to reduce allocations in the batching code, but some may be unavoidable due to resize! for the buffers or some reshuffling of elements in the heap (which could probably be done better).

stevengj commented 1 year ago

This is looking good! Thanks for working on this!

Note that we will also want to add a new section to the manual on this, giving some explanation/motivation and a short example.

lxvm commented 1 year ago

Thanks! I'll start writing a page in the manual and will follow up on the review in another commit

lxvm commented 1 year ago

Examples were added to the manual and the constructors were modified as previously discussed. I also added BatchIntegrand to the exported API. Thank you for reviewing this PR!

lxvm commented 1 year ago

Perhaps the only thing this PR leaves to be desired is a batched and vector-valued integrand combining both BatchIntegrand and InplaceIntegrand. It would be great if it could be done by adding additional dimensions to the output array y in f!(y,x) (e.g. with https://github.com/JuliaArrays/ElasticArrays.jl), but it would make sense to do it in a future PR.

lxvm commented 1 year ago

I was able to implement a batched and inplace integrand via adding additional dimensions to the y buffer. This mostly reuses the existing code by defining a method for quadgk!(::BatchIntegrand,...) and internally creating a BatchIntegrand(InplaceIntegrand(...)). I started this on a branch https://github.com/lxvm/QuadGK.jl/tree/batchinplace, but perhaps it would be best to have it as a package extension, since it would really only work with ElasticArrays.jl

stevengj commented 1 year ago

Upon reflection, I think it would be more useful to make max_batch act like max_evals — it should be a rough limit on the batch size, but not a hard limit. That way, the user doesn't have to worry about accidentally making max_batch slightly too small.

(The main purpose of max_batch is to not allow the buffer allocation to be unbounded, and for this purpose I don't think the user cares if you allocate a slightly bigger buffer. Whereas having quadgk throw an exception if you accidentally forget to change max_batch when you change the order or the minimum number of segments is a huge pain for the user.)

lxvm commented 1 year ago

I agree that making max_batch behave as a soft limit is convenient. We can implement this by always evaluating the minimum number of segments needed for any call which can be done by changing the inner loop in refine from

while len > nsegs && 2m*(nsegs+1) <= f.max_batch && E > tol && numevals < maxevals
    ...
end

to

while len > nsegs && 2m*nsegs < f.max_batch && E > tol && numevals < maxevals
    ...
end

and removing the errors for small values of max_batch

lxvm commented 1 year ago

@stevengj I rebased this pr on master and I think it is ready if it also looks good to you

lxvm commented 1 year ago

@stevengj Do you think this is finished?

lxvm commented 1 year ago

@stevengj I updated the docs and made the changes you suggested, so it should be ready to go.

There is also a straightforward path for extending this PR to inplace and batched integrands. If the y buffer is a VectorOfSimilarArrays of an ElasticArray with a last dimension for batching, then the behavior of BatchIntegrand is identical, except with mutable integrand values, i.e. array views. We could make this work with a quadgk!(f!::BatchIntegrand, result, a, b) method that internally creates a BatchIntegrand(InplaceIntegrand(f!,...),...), then add some if-statement branches to make sure in-place operations are used on the integrand values.

stevengj commented 1 year ago

Thanks for keeping at it!

JuliaMath / QuadGK.jl

Parallelization of integrand evaluations #80

Codecov Report