Unexpected sum Performance Delta vs Unitful

I was running some testing to see if I could boost performance in a package of mine by converting from Unitful but ran into some unexpected benchmark results where Unitful seems to perform about an order-of-magnitude faster when suming vectors of Quantities. Here's a MWE with benchmark results from my system (versioninfo at bottom). I'm basically just instantiating a 1000-length vector of Cartesian length vectors and then suming it. Any ideas about what's going on here?

I first defined a generic struct for Cartesian vectors:

using Unitful
using DynamicQuantities
using BenchmarkTools
using StaticArrays

struct CoordinateCartesian{L}
    x::L
    y::L
    z::L
end

Base.:+(u::CoordinateCartesian, v::CoordinateCartesian) = CoordinateCartesian(u.x+v.x, u.y+v.y, u.z+v.z)

# Alias for each package's @u_str for a unit meter
m_uf = Unitful.@u_str("m")
m_dq = DynamicQuantities.@u_str("m")

Using the CoordinateCartesian struct with Unitful Quantities:

u = CoordinateCartesian(1.0m_uf, 2.0m_uf, 3.0m_uf)
u_arr = repeat([u], 1000)
    """
    1000-element Vector{CoordinateCartesian{Unitful.Quantity{Float64, 𝐋, Unitful.FreeUnits{(m,), 𝐋, nothing}}}}
    """

@benchmark sum($u_arr)
    """
    BenchmarkTools.Trial: 10000 samples with 259 evaluations.
     Range (min … max):  286.486 ns … 563.320 ns  ┊ GC (min … max): 0.00% … 0.00%
     Time  (median):     298.069 ns               ┊ GC (median):    0.00%
     Time  (mean ± σ):   298.142 ns ±  12.318 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▆▂ ▃▂▂ ▂█▄              ▁                                    ▁
      ███▆███▆█████▇▆▆▇█▇█▅▇▅▆███▄▃▁▄▄▁▅▅▅▁▁▃▃▁▃▁▁▄▃▃▃▃▅▅▁▅▁▇▃▇▆▄▆▆ █
      286 ns        Histogram: log(frequency) by time        363 ns <

     Memory estimate: 0 bytes, allocs estimate: 0.
    """

Using the CoordinateCartesian struct with DynamicQuantities Quantities is roughly an order-of-magnitude slower:

v = CoordinateCartesian(1.0m_dq, 2.0m_dq, 3.0m_dq)
v_arr = repeat([v], 1000)
    """
    1000-element Vector{CoordinateCartesian{DynamicQuantities.Quantity{Float64, Dimensions{DynamicQuantities.FixedRational{Int32, 25200}}}}}
    """

@benchmark sum($v_arr)
    """
    BenchmarkTools.Trial: 10000 samples with 9 evaluations.
    Range (min … max):  2.333 μs …   7.833 μs  ┊ GC (min … max): 0.00% … 0.00%
    Time  (median):     2.344 μs               ┊ GC (median):    0.00%
    Time  (mean ± σ):   2.354 μs ± 141.301 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▇█                                                          ▁
    ██▁▆▁█▇▁▅▁▅▁▅▅▁▅▁▄▁▄▅▁▅▁▄▁▁▄▁▁▁▅▅▁▅▁▄▁▅▃▁▅▁▇▁█▆▁▅▁▄▁▆▄▁▃▁▅▅ █
    2.33 μs      Histogram: log(frequency) by time      2.71 μs <

    Memory estimate: 0 bytes, allocs estimate: 0.
    """

Next I tried replacing the struct with the built-in QuantityArray type, which was much slower still, apparently due to allocations.

a = DynamicQuantities.QuantityArray([1.0m_dq, 2.0m_dq, 3.0m_dq])
a_arr = repeat([a], 1000)
    """
    1000-element Vector{QuantityArray{Float64, 1, Dimensions{DynamicQuantities.FixedRational{Int32, 25200}}, DynamicQuantities.Quantity{Float64, Dimensions{DynamicQuantities.FixedRational{Int32, 25200}}}, Vector{Float64}}}
    """

@benchmark sum($a_arr)
    """
    BenchmarkTools.Trial: 10000 samples with 1 evaluation.
     Range (min … max):  151.800 μs …  1.280 ms  ┊ GC (min … max): 0.00% … 85.76%
     Time  (median):     159.400 μs              ┊ GC (median):    0.00%
     Time  (mean ± σ):   162.371 μs ± 43.999 μs  ┊ GC (mean ± σ):  1.06% ±  3.42%

                 ▁▃▃▅▇█▆▄▄▃▃▃▂▂▂▂▂                 ▁  ▁            ▂
      ▄▁▄▄▆▄▆▇███████████████████████▇▇▇█▇▇▆▆▅▆▆▇▇███████▇▆▇▆▆▆▅▆▅ █
      152 μs        Histogram: log(frequency) by time       180 μs <

     Memory estimate: 78.05 KiB, allocs estimate: 999.
    """

How about a simple StaticVector of Quantities? This gets us back to the neighborhood of the struct with DynamicQuantities, but still slower than the Unitful version of the same.

b = SVector(1.0m_dq, 2.0m_dq, 3.0m_dq)
b_arr = repeat([b], 1000)
    """
    1000-element Vector{SVector{3, DynamicQuantities.Quantity{Float64, Dimensions{DynamicQuantities.FixedRational{Int32, 25200}}}}}
    """

@benchmark sum($b_arr)
    """
    BenchmarkTools.Trial: 10000 samples with 9 evaluations.
    Range (min … max):  2.333 μs …   8.067 μs  ┊ GC (min … max): 0.00% … 0.00%
    Time  (median):     2.344 μs               ┊ GC (median):    0.00%
    Time  (mean ± σ):   2.407 μs ± 188.850 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ██                                        ▁▅▃ ▁   ▁         ▂
    ██▆▆▇▃▄▄▅▄▅▁▅▄▅▅▅▅▅▅▅▄▃▁▆▅▅█▇▅▅▅▅▅▅▁▄▇▆▅▅▇█████▁▅▇██▆▄█▇█▅▄ █
    2.33 μs      Histogram: log(frequency) by time      2.93 μs <

    Memory estimate: 0 bytes, allocs estimate: 0.
    """

Again, how does this SVector version compare to one with Unitful?

c = SVector(1.0m_uf, 2.0m_uf, 3.0m_uf)
c_arr = repeat([c], 1000)
    """
    1000-element Vector{SVector{3, Unitful.Quantity{Float64, 𝐋, Unitful.FreeUnits{(m,), 𝐋, nothing}}}}
    """

@benchmark sum($c_arr)
    """
    BenchmarkTools.Trial: 10000 samples with 259 evaluations.
     Range (min … max):  298.069 ns … 495.367 ns  ┊ GC (min … max): 0.00% … 0.00%
     Time  (median):     298.456 ns               ┊ GC (median):    0.00%
     Time  (mean ± σ):   301.102 ns ±  12.656 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █▂▁    ▁     ▁▂                                               ▁
      ███▇▇▆██▇█▇▇███▆▃▄▄▄▆▇▆▆▄▄▄▄▄▄▄▄▄▃▄▁▃▄▄▄▄▄▄▄▄▃▃▄▃▁▅▃▅▆▄▄▃▄▄▃▄ █
      298 ns        Histogram: log(frequency) by time        384 ns <

     Memory estimate: 0 bytes, allocs estimate: 0.
    """

System info:

julia> versioninfo()
  Julia Version 1.9.3
  Commit bed2cd540a (2023-08-24 14:43 UTC)
  Build Info:
    Official https://julialang.org/ release
  Platform Info:
    OS: Windows (x86_64-w64-mingw32)
    CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
    WORD_SIZE: 64
    LIBM: libopenlibm
    LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
    Threads: 16 on 32 virtual cores
  Environment:
    JULIA_NUM_THREADS = 16

(@v1.9) pkg> st
  Status `C:\Users\mikei\.julia\environments\v1.9\Project.toml`
    [6e4b80f9] BenchmarkTools v1.3.2
    [06fc5a27] DynamicQuantities v0.7.1
    [90137ffa] StaticArrays v1.6.5
    [1986cc42] Unitful v1.17.0

Next I tried replacing the struct with the built-in QuantityArray type, which was much slower still, apparently due to allocations.

In my reading, you didn't try out the array of array approach with Unitful, right? That would be interesting to see, but right now it seems like the direct point of comparison between Unitful and DynamicQuantities here is just the factor of 10 between the first two cases.

Edit: you did indeed to do this with StaticArrays, I missed that, but I don't think there's a benchmark for unitful with regular arrays.

@mikeingold rather than an array of QuantityArray, could you try a QuantityArray of arrays, or better yet a 2D QuantityArray?

One option is for us to write a custom sum that simply performs sum on the base array and wraps it with the units again. But I feel like that’s just hiding a potential issue here.

I think the issue is that you are calling the repeat outside of the benchmark. So the compiler doesn’t know it can inline the unit propagation and skip the dimension checks, because it literally doesn’t know that the quantities all have the same units within the sum. Unitful gets to skip that part here because that info is indicated in the type which was already computed prior to the sum (via a promotion).

I think a fair comparison would be to include the repeat within the benchmarked function, or use a 2D QuantityArray, or force the vector to be ::Any for element types in Unitful. Does that make sense?

The original intent here was just to test the performance of some basic operations, e.g. what if I have a collection of vectors that I need to sum? In that sense it didn’t feel right to time the repeat since it was “just” being used to set up the test, but I do suppose it’s a fair counterpoint that maybe the more fundamental performance traits of DynamicQuantities live in that space, rather than in the basic operations themselves. I’ll try to take another look at these tests tonight to see if there’s a more direct apples-to-apples test to be run.

I still do think it’s interesting that the performance delta of sum, though, was an order of magnitude. I’m not sure how much of that is accounted for by some of the heavy lifting being performed by the compiler itself, or maybe it’s just an artifact of a how much of the vector that can be fit into cache.

I've done some more testing and am still getting similar results. Constructing vectors of structs inside the benchmark closed the gap somewhat, but there's still about a 3.5x difference between Unitful and DynamicQuantities. At this point, I suspect the reason is that the Unitful types in this problem are already sufficiently-constrained that type inference and allocations aren't the big tent-pole to begin with. If that's the case, I definitely hadn't expected such a significant delta in what seems to be the performance floor between the two packages, but I suppose it makes sense (units being handled at compile-time vs compute-time).

using DynamicQuantities
using Unitful
using BenchmarkTools

# Cartesian Coordinate with Quantity types
struct CoordinateCartesian{L}
    x::L
    y::L
    z::L
end

Base.:+(u::CoordinateCartesian, v::CoordinateCartesian) = CoordinateCartesian(u.x+v.x, u.y+v.y, u.z+v.z)

# Test: Sum an N-length vector of CoordinateCartesian using default DynamicQuantities
function test_DynamicQuantities(N)
    arr = [CoordinateCartesian(DynamicQuantities.Quantity(rand(), length=1),
                               DynamicQuantities.Quantity(rand(), length=1),
                               DynamicQuantities.Quantity(rand(), length=1) ) for i in 1:N]
    sum(arr)
end

# Test: Sum an N-length vector of CoordinateCartesian using compact DynamicQuantities
function test_DynamicQuantities_R8(N)
    R8 = Dimensions{DynamicQuantities.FixedRational{Int8,6}}
    arr = [CoordinateCartesian(DynamicQuantities.Quantity(rand(), R8, length=1),
                               DynamicQuantities.Quantity(rand(), R8, length=1),
                               DynamicQuantities.Quantity(rand(), R8, length=1) ) for i in 1:N]
    sum(arr)
end

# Test: Sum an N-length vector of CoordinateCartesian using Unitful
function test_Unitful(N)
    arr = [CoordinateCartesian(Unitful.Quantity(rand(), Unitful.@u_str("m")),
                               Unitful.Quantity(rand(), Unitful.@u_str("m")),
                               Unitful.Quantity(rand(), Unitful.@u_str("m")) ) for i in 1:N]
    sum(arr)
end

bench_dq    = @benchmark test_DynamicQuantities($1000) evals=100
bench_dq_r8 = @benchmark test_DynamicQuantities_R8($1000) evals=100
bench_uf    = @benchmark test_Unitful($1000) evals=100

Results

julia> bench_dq
BenchmarkTools.Trial: 969 samples with 100 evaluations.
 Range (min … max):  24.643 μs … 81.427 μs  ┊ GC (min … max):  0.00% … 35.44%
 Time  (median):     48.954 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   51.681 μs ± 10.649 μs  ┊ GC (mean ± σ):  12.43% ± 16.48%

                     ▄   ▂ ▁▂  ▁█
  ▂▁▁▂▃▃▂▂▁▂▃▁▁▁▂▂▃▃▆█▇▄▇█████▆███▄▃▃▂▃▂▂▁▂▂▃▃▃▃▄▃▄▄▅▄▃▃▅▄▃▃▄ ▃
  24.6 μs         Histogram: frequency by time        76.2 μs <

 Memory estimate: 117.23 KiB, allocs estimate: 2.

julia> bench_dq_r8
BenchmarkTools.Trial: 146 samples with 100 evaluations.
 Range (min … max):  329.198 μs … 382.411 μs  ┊ GC (min … max): 0.00% … 4.22%
 Time  (median):     345.723 μs               ┊ GC (median):    4.60%
 Time  (mean ± σ):   345.099 μs ±   8.225 μs  ┊ GC (mean ± σ):  2.77% ± 2.48%

             ▂ ▅▅▆           ▃█▂▅▅▃▂ ▅     ▃
  ▅▅▁▄▇▁▁█▄▁▁█▅███▇▅▅▄▄▁▁▇▁▅▄███████▁█▇▅▁█▁█▄█▁▇▇▄▅▄▄▄▁▄▁▁▄▄▁▁▄ ▄
  329 μs           Histogram: frequency by time          364 μs <

 Memory estimate: 250.16 KiB, allocs estimate: 7005.

julia> bench_uf
BenchmarkTools.Trial: 3537 samples with 100 evaluations.
 Range (min … max):   7.580 μs … 53.050 μs  ┊ GC (min … max):  0.00% … 75.62%
 Time  (median):     13.517 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   14.124 μs ±  6.536 μs  ┊ GC (mean ± σ):  10.01% ± 15.33%

  ▇▂  ▁ ▃▇▇██▅▂▂▁▂▂                                           ▂
  ██▇██████████████▆▁▁▁▁▁▁▁▁▁▁▃▁▅▆▅▆▆▇▇▆▄▅▆▆▄▃▄▁▄▄▃▄▇▆▇▆▃▆▇█▇ █
  7.58 μs      Histogram: log(frequency) by time      45.8 μs <

 Memory estimate: 23.48 KiB, allocs estimate: 2.

Summary: The Unitful version of this test ran about 3.5x faster than the vanilla DynamicQuantities version. Running the DynamicQuantities version with a more compact type was apparently an order-of-magnitude slower.

Writing it like DynamicQuantities.Quantity(rand(), length=1) is a bit slower than doing the @u_str stuff like you did for Unitful:

function test_dq_naive(N)
    arr = [
        CoordinateCartesian(
            rand() * DQ.@u_str("m"),
            rand() * DQ.@u_str("m"),
            rand() * DQ.@u_str("m")
        )
        for _ in 1:N
    ]
    sum(arr)
end

bench_dq_naive = @benchmark test_dq_naive(1000) evals=100

For me this gives a min time of 17.264 us compared to Unitful's 7.581 us, so a 2.3x difference.

(This is just because the macro => computes the Dimensions object at compile time)

This is obviously still not great. However this is really tricky because

struct CoordinateCartesian{L}
    x::L
    y::L
    z::L
end

is optimal for Unitful.jl (the L stores the unit information), but not really for DynamicQuantities.jl. So again, with Unitful, the compiler knows that all the units are the same => sum is fast. But DynamicQuantities.jl, the compiler has no idea what the units are => sum is slow.

To make things faster for DynamicQuantities, you need to wrap stuff in QuantityArray. That way you can tell Julia: "these quantities all share the same units", and it can avoid the dimension checks it has to do (right now it has to compare the units of every single element in the sum, whereas Unitful can skip them).

Writing it like DynamicQuantities.Quantity(rand(), length=1) is a bit slower than doing the @u_str stuff like you did for Unitful ... (This is just because the macro => computes the Dimensions object at compile time)

Interesting. I'd actually guessed the opposite, that calling the constructor function directly would've been less complicated than a macro, but that makes sense.

To make things faster for DynamicQuantities, you need to wrap stuff in QuantityArray. That way you can tell Julia: "these quantities all share the same units", and it can avoid the dimension checks it has to do (right now it has to compare the units of every single element in the sum, whereas Unitful can skip them).

Is the following a fair implementation of this advic, adapting from the prior pattern? These results are roughly in the neighborhood of what I'm seeing for Vector{Quantity} and similar implementations, definitely still much slower than structs or StaticVectors.

# Test: Sum an N-length vector of QuantityArray's
function test_QuantityArray(N)
    arr = [DynamicQuantities.QuantityArray([rand(),rand(),rand()], DynamicQuantities.@u_str("m")) for _ in 1:N]
    sum(arr)
end

"""
BenchmarkTools.Trial: 219 samples with 100 evaluations.
 Range (min … max):  211.921 μs … 279.226 μs  ┊ GC (min … max): 0.00% … 5.85%
 Time  (median):     231.468 μs               ┊ GC (median):    6.02%
 Time  (mean ± σ):   228.822 μs ±   9.042 μs  ┊ GC (mean ± σ):  3.93% ± 3.10%

              ▂▁ ▂                         ▂      █ ▇ ▃ ▄
  ▅▄▁▁▄▁▅▁▄▃▄▄██▇█▇▆▆▇▃▆▅▆▁▁▃▁▃▃▁▁▁▁▁▁▆▆▅▃▆█▁▇▁█▄██▇█▇█▅█▄▆▆▃▃▄ ▄
  212 μs           Histogram: frequency by time          240 μs <

 Memory estimate: 273.41 KiB, allocs estimate: 3001.
"""

The more I'm looking at this, the more it seems like I got lucky and intuited my existing Unitful code into a relatively optimal state where the types are constrained/defined enough to avoid a lot of the big performance pitfalls.

The only real remaining "Issue" in my mind is just how surprising it was that there seems to be a delta in what you might call the performance floor of each package, i.e. that it is possible for Unitful to be faster in very specific situations. I'm planning to do some more testing with more complicated expressions to see if the type constraints continue to hold the line, or at what point they break down. Are you on the Julia Discourse @MilesCranmer? Maybe it would be better to migrate this topic there vs it being tracked as a concrete "Issue" here?

Is the following a fair implementation of this advic, adapting from the prior pattern?

Almost but not quite. Since you are summing coordinates along the sample axis, the QuantityArray needs to also wrap the sample axis (in order for Julia to remove the dimensional analysis). So e.g., all the x coordinates should be stored in a QuantityArray, and all the y in another QuantityArray. Right now you are storing each x, y, z in a single QuantityArray which I don’t think will help since you never sum x + y + z.

Unitful gets around this via a sophisticated set of promotion rules. When you create an array of types prametrized to L, the units, it sees at compile time that all the elements have that type, and so the entire array is Array{...{L}}, and so the sum doesn’t need to do dimensional analysis.

So to get closer in performance you need to also tell Julia that all the units are the same, which you can do with a QuantityArray.

Julia discourse

Sounds good to me, it should be a nice way to get additional performance tips.

@mikeingold I wrote a little example in #49 for how I think you should actually do this (once that PR merges):

First, make the coords:

struct Coords
    x::Float64
    y::Float64
end

# Define arithmetic operations on Coords
Base.:+(a::Coords, b::Coords) = Coords(a.x + b.x, a.y + b.y)
Base.:-(a::Coords, b::Coords) = Coords(a.x - b.x, a.y - b.y)
Base.:*(a::Coords, b::Number) = Coords(a.x * b, a.y * b)
Base.:*(a::Number, b::Coords) = Coords(a * b.x, a * b.y)
Base.:/(a::Coords, b::Number) = Coords(a.x / b, a.y / b)

We can then build a GenericQuantity out of this:

coord1 = GenericQuantity(Coords(0.3, 0.9), length=1)
coord2 = GenericQuantity(Coords(0.2, -0.1), length=1)

and perform operations on these:

coord1 + coord2 |> uconvert(us"cm")
# (Coords(50.0, 80.0)) cm

The nice part about this is it only stores a single Dimensions (or SymbolicDimensions) for the entire struct!

Then, we can build an array like so:

function test_QuantityArray(N)
    coord_array = QuantityArray([GenericQuantity(Coords(rand(), rand()), length=1) for i=1:N])
    sum(coord_array)
end

This QuantityArray already indicates that all the dimensions are the same, and thus the summation should be faster! The only remaining reason why it wouldn't be as fast is due to the compiler not inlining enough.

I'm getting near-identical performance to a regular array now!!

julia> @btime sum(coord_array) setup=(N=1000; coord_array=QuantityArray([GenericQuantity(Coords(rand(), rand()), length=1) for i=1:N]))
  1.113 μs (0 allocations: 0 bytes)
(Coords(501.7717111461543, 494.36328730797095)) m

julia> @btime sum(array) setup=(N=1000; array=[Coords(rand(), rand()) for i=1:N])
  1.087 μs (0 allocations: 0 bytes)
Coords(505.4496129866645, 507.2903371535713)

Sorry for the delay, haven't had as much time lately to work on this. I'm excited to try out the new update when it's available on General! At this point I'd propose using PR #49 as justification to close this Issue.

Cool! Closing with v0.8.0.

SymbolicML / DynamicQuantities.jl

Unexpected sum Performance Delta vs Unitful #55