Comparing with Matplotlib's triangulation

henry2004y commented 3 months ago

Hi! Thanks for the great package!

As a user, I want to report a performance comparison I found against Matplotlib:

using NaturalNeighbours
using StableRNGs
using PyPlot

function meshgrid(x, y)
   X = [x for _ in y, x in x]
   Y = [y for y in y, _ in x]

   X, Y
end

## The data 
rng = StableRNG(123)
f = (x, y) -> sin(x * y) - cos(x - y) * exp(-(x - y)^2)
x = vec([(i - 1) / 9 for i in (1, 3, 4, 5, 8, 9, 10), j in (1, 2, 3, 5, 6, 7, 9, 10)])
y = vec([(j - 1) / 9 for i in (1, 3, 4, 5, 8, 9, 10), j in (1, 2, 3, 5, 6, 7, 9, 10)])
z = f.(x, y)

## The interpolant and grid
xg = LinRange(0, 1, 100)
yg = LinRange(0, 1, 100)
_x = vec([x for x in xg, _ in yg])
_y = vec([y for _ in xg, y in yg])

## Evaluate some interpolants
println("NaturalNeighbours.jl:")
@time begin
   itp = interpolate(x, y, z; derivatives=true)
   triangle_vals = itp(_x, _y; method=Triangle())
   _x = vec([x for x in xg, _ in yg])
   _y = vec([y for _ in xg, y in yg])
   Wi1 = itp(_x, _y; method=Triangle());
end
@time begin
   itp = interpolate(x, y, z; derivatives=true)
   triangle_vals = itp(_x, _y; method=Triangle())
   _x = vec([x for x in xg, _ in yg])
   _y = vec([y for _ in xg, y in yg])
   Wi1 = itp(_x, _y; method=Triangle());
end

println("Matplotlib:")
@time begin
   triang = matplotlib.tri.Triangulation(x, y)
   # Perform linear interpolation on the triangle mesh
   interpolator = matplotlib.tri.LinearTriInterpolator(triang, z)
   Xi, Yi = meshgrid(xg, yg)
   Wi = interpolator(Xi, Yi)
end
@time begin
   triang = matplotlib.tri.Triangulation(x, y)
   # Perform linear interpolation on the triangle mesh
   interpolator = matplotlib.tri.LinearTriInterpolator(triang, z)
   Xi, Yi = meshgrid(xg, yg)
   Wi = interpolator(Xi, Yi)
end

println("")

NaturalNeighbours.jl:
  5.368652 seconds (8.50 M allocations: 560.056 MiB, 5.17% gc time, 99.57% compilation time)
  0.089407 seconds (150.61 k allocations: 7.037 MiB, 24.26% gc time, 74.26% compilation time)
Matplotlib:
  0.207081 seconds (288.81 k allocations: 19.936 MiB, 97.51% compilation time: 6% of which was recompilation)
  0.001353 seconds (80 allocations: 239.234 KiB)

This is running with 1 thread, but NaturalNeighbours.jl is ~ 16 times slower. I don't know if Matplotlib automatically uses multithreading, but this performance difference is noticeable. I don't trust the allocation statistics reported above for Matplotlib, but the measured time should be ok.

DanielVandH commented 3 months ago

Thanks for the report @henry2004y. I'll have a bit more to say about this but is the line triangle_vals = itp(_x, _y; method=Triangle()) intentional in your timings? You seem to be doing itp(...) twice

DanielVandH commented 3 months ago

Some thoughts

You aren't exactly comparing the same things here. With derivatives=true you are also timing derivative generation which Matplotlib wouldn't be doing (I've never used Matplotlib, so correct me if I'm wrong).
For benchmarking, you should put things into functions and then use BenchmarkTools.jl. Otherwise you're including too much compilation and access of global variables in the timings (you can see this with the huge percentage of compilation time reported from your @time outputs.

With this in mind, here's a script showing what I've looked into for this and how I've rewritten your benchmarks.

using NaturalNeighbours
using PyPlot
using BenchmarkTools

function meshgrid(x, y)
    X = [x for _ in y, x in x]
    Y = [y for y in y, _ in x]
    X, Y
end

f = (x, y) -> sin(x * y) - cos(x - y) * exp(-(x - y)^2)
const x = vec([(i - 1) / 9 for i in (1, 3, 4, 5, 8, 9, 10), j in (1, 2, 3, 5, 6, 7, 9, 10)])
const y = vec([(j - 1) / 9 for i in (1, 3, 4, 5, 8, 9, 10), j in (1, 2, 3, 5, 6, 7, 9, 10)])
const z = f.(x, y)

const xg = LinRange(0, 1, 100)
const yg = LinRange(0, 1, 100)

function nn_benchmark()
    itp = interpolate(x, y, z; derivatives=false)
    X, Y = meshgrid(xg, yg)
    _x, _y = vec(X), vec(Y)
    return itp(_x, _y; method=Triangle(), parallel=false)
end
function mpl_benchmark()
    triang = matplotlib.tri.Triangulation(x, y)
    interpolator = matplotlib.tri.LinearTriInterpolator(triang, z)
    X, Y = meshgrid(xg, yg)
    return interpolator(X, Y)
end

@benchmark nn_benchmark()
#=
julia> @benchmark nn_benchmark()
BenchmarkTools.Trial: 297 samples with 1 evaluation.
 Range (min … max):  15.082 ms … 95.031 ms  ┊ GC (min … max): 0.00% … 46.52%
 Time  (median):     15.459 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.848 ms ±  7.385 ms  ┊ GC (mean ± σ):  4.69% ±  7.69%

  █▄  
  ██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▄▄▅ ▅
  15.1 ms      Histogram: log(frequency) by time      48.8 ms <

 Memory estimate: 1.85 MiB, allocs estimate: 48720.
=#
@benchmark mpl_benchmark()
#=
julia> @benchmark mpl_benchmark()
BenchmarkTools.Trial: 2832 samples with 1 evaluation.
 Range (min … max):  1.030 ms …  14.261 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.822 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.759 ms ± 417.625 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄                       ▄▅▆█▆▅▇▄▅▄▂▁▂▁▁▁
  ▂▄▅▆███▇▅▆▄▃▃▃▄▃▂▄▃▄▅▅▄▅▄▅▆▆█████████████████▆▆▅▅▄▄▃▃▂▂▂▂▂▂ ▄
  1.03 ms         Histogram: frequency by time        2.46 ms <

 Memory estimate: 237.89 KiB, allocs estimate: 74.
=#

# Your observed 16x difference is real. What if we use threads?

function nn_benchmark_threads() # I'm using 10 threads
    itp = interpolate(x, y, z; derivatives=false)
    X, Y = meshgrid(xg, yg)
    _x, _y = vec(X), vec(Y)
    return itp(_x, _y; method=Triangle(), parallel=true)
end

@benchmark nn_benchmark_threads()

#=
julia> @benchmark nn_benchmark_threads()
BenchmarkTools.Trial: 1322 samples with 1 evaluation.
 Range (min … max):  2.352 ms … 94.722 ms  ┊ GC (min … max):  0.00% … 80.89%
 Time  (median):     2.645 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.770 ms ±  7.382 ms  ┊ GC (mean ± σ):  19.06% ±  9.26%

  █▁ 
  ██▅▁▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▇ ▇
  2.35 ms      Histogram: log(frequency) by time     43.7 ms <

 Memory estimate: 1.84 MiB, allocs estimate: 47709.
=#

So, without multithreading your observed differences in timings are real. With threading, the results are a bit closer but still not what I'd like. I am quite impressed with how matplotlib must be doing this to get this so fast. The number of allocations with NaturalNeighbours.jl I'm getting seems to suggest there is a type instability somewhere in the code. I'll do some digging.

DanielVandH commented 3 months ago

I did find one issue #35, and another that can't be fixed without a new major release, but they don't make much difference. I'm honestly not sure what the main difference is coming from and I can't tell exactly what Matplotlib is doing.

henry2004y commented 3 months ago

I know little about how Matplotlib achieves this. This is probably the interface of the method: https://github.com/matplotlib/matplotlib/blob/be56634d682bed257cb941369d8d3600635ddadf/lib/matplotlib/tri/_triinterpolate.py#L232

DanielVandH commented 3 months ago

I probably can't look at that for thinking of changes depending on Matplotlib's license, I'd have to double check that. I was thinking though: I've had to do something similar previously in FiniteVolumeMethod.jl which essentially uses this interpolation, and there was a bit of a difference between (1) storing the barycentric coordinates (these are the coordinates (a, b, c) defining the interpolant fT(x, y) = a x + b y + c in a triangle T) for each triangle to re-use or (2) computing the barycentric coordinates each time on demand. I ended up deciding to store them (https://github.com/SciML/FiniteVolumeMethod.jl/blob/9dc2612219b95eb49b4a81d8cace441f10b6ded4/src/geometry.jl#L21).

Maybe matplotlib stores it as well? I wonder if I should add some sort of cache into NaturalNeighboursInterpolant for storing results relating to a certain interpolant, e.g. something like

struct InterpolantCache{T1<:TriangleCache,T2 <:SibsonCache}
triangle::T1 
sibson::T2 # not sure if sibson needs a cache actually
... other methods ...
end
struct TriangleCache{T<:Triangulation}
coefficients::Dict{NTuple{3,Int},NTuple{3,Float64}}
end
...

that the interpolant could then leverage if the user asks to store this (since they might not always want to for larger problems where the data would be a bit much). The caches would be empty until a user calls itp with the specific method.

DanielVandH commented 3 months ago

Well, I got around to looking at caching - see https://github.com/DanielVandH/NaturalNeighbours.jl/tree/triangle. It doesn't really make things much better unfortunately.

julia> @benchmark nn_benchmark()
BenchmarkTools.Trial: 443 samples with 1 evaluation.
 Range (min … max):   9.076 ms … 426.486 ms  ┊ GC (min … max): 0.00% … 59.54%
 Time  (median):      9.986 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.203 ms ±  20.125 ms  ┊ GC (mean ± σ):  6.15% ±  4.00%

       ▃▇▅▂▃▄█▂          
  ▃▃▃▃▇████████▇█▅▃▃▂▃▁▃▂▁▃▁▁▁▁▁▁▃▂▂▃▁▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  9.08 ms         Histogram: frequency by time         14.8 ms <

 Memory estimate: 1.84 MiB, allocs estimate: 47159.

julia> @benchmark mpl_benchmark()
BenchmarkTools.Trial: 2906 samples with 1 evaluation.
 Range (min … max):  1.079 ms … 263.383 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.594 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.714 ms ±   4.880 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄▄██▇█▆▅▃▃▆▂▅▄▄▃▅▅▆▆█▄▅▅▄▂▁
  ▁▁▂███████████████████████████████▆▆▄▅▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁ ▅
  1.08 ms         Histogram: frequency by time         2.6 ms <

 Memory estimate: 237.89 KiB, allocs estimate: 74.

julia> function nn_benchmark_threads() # I'm using 10 threads
           itp = interpolate(x, y, z; derivatives=false)
           X, Y = meshgrid(xg, yg)
           _x, _y = vec(X), vec(Y)
           return itp(_x, _y; method=Triangle(; allow_cache = true), parallel=true)
       end
nn_benchmark_threads (generic function with 1 method)

julia> @benchmark nn_benchmark_threads()
BenchmarkTools.Trial: 1296 samples with 1 evaluation.
 Range (min … max):  2.343 ms … 186.348 ms  ┊ GC (min … max):  0.00% … 59.95%
 Time  (median):     2.646 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.845 ms ±  11.402 ms  ┊ GC (mean ± σ):  18.90% ±  6.32%

   ▁█▇▂
  ▃████▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂ ▃
  2.34 ms         Histogram: frequency by time        7.45 ms <

 Memory estimate: 1.87 MiB, allocs estimate: 47579.

I'm not going to commit the cached Triangle work in the linked branch since the difference isn't too significant and it's a bit clunky.

The only other difference could come from DelaunayTriangulation.jl's jump_and_march (i.e. how point location is done to find the triangle containing the query point). Unfortunately jump_and_march is a bit of a headache to work with so not sure what more to do there, as the function is extremely complicated and the reasons for its speed would be either

Because of a bad implementation (possible, only one person has ever really looked at it in detail [me]),
Because of ExactPredicates.jl, which is unfortunately needed and the slowdown is just how it is. I don't know if Matplotlib uses exact predicates anywhere.

One idea to improve jump_and_march uses within itp is to come up with a better way to sort the points in space so that nearby points are queried in sequence, so jump_and_march is started at a point that is very close to the triangle containing it. This is a bit complicated to do, especially with multithreading (unless we use e.g. a spatial indexing tree), so not sure I want to commit to looking into that (and I don't have the time probably).

DanielVandH commented 3 months ago

All that said, if you are happy with the comparison between the performance with parallel=true compared to Matplotlib as I've demonstrated above, feel free to close this. Otherwise we can leave this open incase someday improvements can be made. I suspect it's an issue of DelaunayTriangulation.jl using exact predicates while Matplotlib probably doesn't.

For evaluating at many points, NaturalNeighbours.jl does seem to win with multithreading at least. You can run the above tests with

const xg = LinRange(0, 1, 250)
const yg = LinRange(0, 1, 250)

and the multithreading benchmark is 2x faster than Matplotlib, although 10x worse without.

e: Actually, it's 2x slower without caching.. Okay, I'm going to push that caching. You should use allow_cache=true in your Triangle() calls once v1.4.0 is released.

henry2004y commented 3 months ago

Thanks for your effort! I would suggest leave this open as a reminder if someone intends to refactor the code and speed it up~

DanielVandH commented 3 months ago

The caching work should be registered soon (~15 minutes) as v1.3.4. By default, it will now cache the necessary coordinates for Triangle() (so you don't actually need allow_cache=true as I suggested above) and hopefully give you some better performance! Thanks for the report @henry2004y.

DanielVandH commented 2 months ago

@henry2004y With DelaunayTriangulation 1.1.0 you can more easily change how predicates are computed by using the predicates keyword, which might also help your performance. e.g.

julia> function nn_benchmark(kernel)
           itp = interpolate(x, y, z; derivatives=false)
           X, Y = meshgrid(xg, yg)
           _x, _y = vec(X), vec(Y)
           return itp(_x, _y; method=Triangle(), parallel=true, predicates = kernel)
       end
nn_benchmark (generic function with 2 methods)

julia> @benchmark nn_benchmark($FastKernel())
BenchmarkTools.Trial: 270 samples with 1 evaluation.
 Range (min … max):  12.527 ms … 27.177 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.385 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.495 ms ±  1.480 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                             ▃▄▂▆▇█▄▇▂
  ▃▁▁▁▃▁▁▁▁▁▁▃▃▁▁▃▃▁▁▃▁▃▁▄▆▆▇██████████▇▇▅▆▄▄▅▃▃▄▁▃▃▁▃▁▃▁▁▁▃▃ ▃
  12.5 ms         Histogram: frequency by time        23.2 ms <

 Memory estimate: 485.34 KiB, allocs estimate: 889.

julia> @benchmark nn_benchmark($AdaptiveKernel()) # the new default
BenchmarkTools.Trial: 170 samples with 1 evaluation.
 Range (min … max):  24.946 ms … 40.503 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.362 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.488 ms ±  2.992 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂ ▂ ▃▅█  ▃▂▃▅  ▂  ▅ ▆█▅▅▃▂    ▃
  █▄█████▁▄███████▇▄████████▅▄▅▄█▅▄▅▅▇▅▅▄▄▁▄▇▁▁▄▁▄▁▄▁▁▁▁▁▁▁▁▄ ▄
  24.9 ms         Histogram: frequency by time        38.9 ms <

 Memory estimate: 484.62 KiB, allocs estimate: 889.

julia> @benchmark nn_benchmark($ExactKernel()) # the old default
BenchmarkTools.Trial: 163 samples with 1 evaluation.
 Range (min … max):  24.614 ms … 300.534 ms  ┊ GC (min … max): 0.00% … 47.87%
 Time  (median):     28.723 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.088 ms ±  21.944 ms  ┊ GC (mean ± σ):  3.57% ±  4.94%

    ▃ ▇▁█▇▃▃▃▇▇▆▁ ▁▂
  ▃▃████████████████▇▄▄▃▇▆▃▆▃▄▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  24.6 ms         Histogram: frequency by time         46.6 ms <

 Memory estimate: 1.38 MiB, allocs estimate: 35231.

henry2004y commented 2 months ago

Thank you! I'll give it a try~

DanielVandH / NaturalNeighbours.jl

Comparing with Matplotlib's triangulation #34