slow hcubature for large vector-valued integrands

mzaffalon commented 3 years ago

I am integrating L(f) = \int K.(f .- g(x)) dx where f is a vector of 100-1000 elements, g and K are scalar, the integration is in R^3. The dimension of the codomain of K is the same as f because it is easy to vectorize the computation of K.

HCubature does worse than Cubature, the memory allocation being 100 times that of Cubature (SVectors do not perform well for large vectors, although it may not have anything to do with SVector) and 20 times slower. I am not sure if part of the problem is also that Cubature allows one to copy in place the result of the computation.

Here is a MWE with the timings on my machine. Is there a workaround or will passing an integrand with an extra argument that accepts an in-place return value be considered?

using Cubature
using HCubature
using StaticArrays: SVector
using BenchmarkTools

function test_HCubature(sf)
    cnt = 0

    function integr(pt)
        cnt += 1
        inv.(1 .+ (sf .+ pt[1]*cos(pt[2])*sin(pt[3])).^2)
    end
    xmin, xmax = [0.0,0,-2], [0.5, 2π,+2]
    r, e = HCubature.hcubature(integr, xmin, xmax; rtol=1e-1, atol=1e-1)
    #println("iter: $cnt")
    r, e
end

function test_Cubature(f)
    cnt = 0
    function integr(pt, res)
        cnt += 1
        res .= inv.(1 .+ (f .+ pt[1]*cos(pt[2])*sin(pt[3])).^2)
    end
    xmin, xmax = [0.0,0,-2], [0.5, 2π,+2]
    r, e = Cubature.hcubature(length(f), integr, xmin, xmax; reltol=1e-1, abstol=1e-1)
    #println("iter: $cnt")
    r, e
end

f = collect(range(-3,3,length=101))
sf = SVector{length(f)}(f)

@benchmark test_HCubature($sf)
BenchmarkTools.Trial:
  memory estimate:  1.18 MiB
  allocs estimate:  60213
  --------------
  minimum time:     2.439 ms (0.00% GC)
  median time:      2.578 ms (0.00% GC)
  mean time:        2.689 ms (1.55% GC)
  maximum time:     4.703 ms (34.03% GC)
  --------------
  samples:          1858
  evals/sample:     1

julia> @benchmark test_Cubature($f)
BenchmarkTools.Trial:
  memory estimate:  8.94 KiB
  allocs estimate:  160
  --------------
  minimum time:     15.500 μs (0.00% GC)
  median time:      16.800 μs (0.00% GC)
  mean time:        18.832 μs (2.05% GC)
  maximum time:     2.109 ms (98.33% GC)
  --------------
  samples:          10000
  evals/sample:     1

EDIT: timing with @benchmark.

stevengj commented 3 years ago

Yes, Cubature works much more in-place for vector-valued integrands than HCubature. It's not just a matter of providing an in-place integrand function, however — a lot of other parts of the code would need to be reworked to operate in-place.

@ChrisRackauckas, do you have any advice here from your experience with DifferentialEquations? .= doesn't work for scalars or other immutables, so I'm not sure what's the best way to write code that is in-place for vectors but also works for scalars.

(A lot of realistic problems involve more expensive integrands, so that the overhead of the extra allocations matters less.)

ChrisRackauckas commented 3 years ago

I don't have good advice, but I can tell you how we do it in DifferentialEquations.jl. It's not really possible to handle in-place and out-of-place in a single set of functions, so in the internals of the time stepping we have both versions, one optimized for the static array and reverse-mode AD case (since reverse-mode is another case where the array-based operations allocating is helpful), while the other is a fully non-allocating mutating case. For example, look at the implementation of Dormand-Prince:

https://github.com/SciML/OrdinaryDiffEq.jl/blob/master/src/perform_step/low_order_rk_perform_step.jl#L728-L819

My idea here is that, in the library we might as well code the fastest thing we can until the compiler can handle it better, but it can't right now. There's too much if inplace, @. x = y else x = @. y and all of those other edge cases. And you cannot determine it from type information. For example, with reverse-mode AD on Array you want to use out-of-place instead of in-place, so it needs to be a user choice. And in a higher order function, you need to change dx = f(x) vs f!(dx,x) (though in theory you could have an API of dx = f!(dx,x), but that sounds bug-prone IMO).

The next thing to address then is how to automatically detect this. What we do is look at the current method table:

https://github.com/SciML/DiffEqBase.jl/blob/master/src/utils.jl#L1-L46

and then use this to place a type-parameter in the problem construction:

https://github.com/SciML/DiffEqBase.jl/blob/master/src/problems/ode_problems.jl#L72

So then it's type-information after that point. That lets the user override it if necessary, and lets the user add it as a literal to make it type stable. While it feels a little hacky, in practice we've been doing this for 4 years and no one has really ever complained. The breaking scenario is that a user defines f(u,p,t), then f(du,u,p,t), and then wants the out-of-place behavior but is unwilling to write ODEProblem{false}(...). In theory that could happen, but it's not something that has shown up in user issues (because normally it's f vs f!, etc.) so it has generally worked with only very light documentation of the fact that it could be overrode.

One other semi-important detail is that we pair this with a separate broadcast implementation (@..) which forces @simd ivdep for the internal uses where we know the arrays do not alias, and this allows all of the internal operations to be broadcasting and thus support GPUs as well without losing optimality on CPUs.

That's probably not as elegant of an answer as you were hoping for, but I think it's the only real answer to get the most optimal dispatches for the two cases until there's better lifetime management and array freezing in the compiler.

mzaffalon commented 3 years ago

I understand only the first part of your answer and I can try to write a PR to do in-place mutation.

stevengj commented 3 years ago

The basic code that would have to be duplicated is the rule evaluation, e.g. here.

We could use a type parameter in the rule construction to say whether we want an in-place rule or not.

I don't think we need to look at the method table. Probably we could just have an interface hcubature! where the user passes a pre-allocated array to store the integrals, and in this interface we could assume they also passed an in-place function f!(result, x), and then pass a parameter to the low-level _hcubature routine to construct this version of the rule.

ChrisRackauckas commented 3 years ago

Yes, I agree that in this use case it's overkill to do the methods table reading. (and probably in DiffEq, but by now the interface is set in stone).

mzaffalon commented 3 years ago

If I understand correctly, there should be a function (g::GenzMalik{n,T})(result::AbstractVector{T}, f, a::AbstractVector{T}, b::AbstractVector{T}, norm=norm) where {T} that duplicates the current code for in-place mutation.

stevengj commented 3 years ago

Yes, probably by adding a type parameter to GenzMalik indicating whether it is in place.

JuliaMath / HCubature.jl

slow hcubature for large vector-valued integrands #30