Open henry2004y opened 3 months ago
Thanks for the report @henry2004y. I'll have a bit more to say about this but is the line triangle_vals = itp(_x, _y; method=Triangle())
intentional in your timings? You seem to be doing itp(...)
twice
Some thoughts
derivatives=true
you are also timing derivative generation which Matplotlib wouldn't be doing (I've never used Matplotlib, so correct me if I'm wrong).@time
outputs.With this in mind, here's a script showing what I've looked into for this and how I've rewritten your benchmarks.
using NaturalNeighbours
using PyPlot
using BenchmarkTools
function meshgrid(x, y)
X = [x for _ in y, x in x]
Y = [y for y in y, _ in x]
X, Y
end
f = (x, y) -> sin(x * y) - cos(x - y) * exp(-(x - y)^2)
const x = vec([(i - 1) / 9 for i in (1, 3, 4, 5, 8, 9, 10), j in (1, 2, 3, 5, 6, 7, 9, 10)])
const y = vec([(j - 1) / 9 for i in (1, 3, 4, 5, 8, 9, 10), j in (1, 2, 3, 5, 6, 7, 9, 10)])
const z = f.(x, y)
const xg = LinRange(0, 1, 100)
const yg = LinRange(0, 1, 100)
function nn_benchmark()
itp = interpolate(x, y, z; derivatives=false)
X, Y = meshgrid(xg, yg)
_x, _y = vec(X), vec(Y)
return itp(_x, _y; method=Triangle(), parallel=false)
end
function mpl_benchmark()
triang = matplotlib.tri.Triangulation(x, y)
interpolator = matplotlib.tri.LinearTriInterpolator(triang, z)
X, Y = meshgrid(xg, yg)
return interpolator(X, Y)
end
@benchmark nn_benchmark()
#=
julia> @benchmark nn_benchmark()
BenchmarkTools.Trial: 297 samples with 1 evaluation.
Range (min … max): 15.082 ms … 95.031 ms ┊ GC (min … max): 0.00% … 46.52%
Time (median): 15.459 ms ┊ GC (median): 0.00%
Time (mean ± σ): 16.848 ms ± 7.385 ms ┊ GC (mean ± σ): 4.69% ± 7.69%
█▄
██▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▄▄▅ ▅
15.1 ms Histogram: log(frequency) by time 48.8 ms <
Memory estimate: 1.85 MiB, allocs estimate: 48720.
=#
@benchmark mpl_benchmark()
#=
julia> @benchmark mpl_benchmark()
BenchmarkTools.Trial: 2832 samples with 1 evaluation.
Range (min … max): 1.030 ms … 14.261 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.822 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.759 ms ± 417.625 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄ ▄▅▆█▆▅▇▄▅▄▂▁▂▁▁▁
▂▄▅▆███▇▅▆▄▃▃▃▄▃▂▄▃▄▅▅▄▅▄▅▆▆█████████████████▆▆▅▅▄▄▃▃▂▂▂▂▂▂ ▄
1.03 ms Histogram: frequency by time 2.46 ms <
Memory estimate: 237.89 KiB, allocs estimate: 74.
=#
# Your observed 16x difference is real. What if we use threads?
function nn_benchmark_threads() # I'm using 10 threads
itp = interpolate(x, y, z; derivatives=false)
X, Y = meshgrid(xg, yg)
_x, _y = vec(X), vec(Y)
return itp(_x, _y; method=Triangle(), parallel=true)
end
@benchmark nn_benchmark_threads()
#=
julia> @benchmark nn_benchmark_threads()
BenchmarkTools.Trial: 1322 samples with 1 evaluation.
Range (min … max): 2.352 ms … 94.722 ms ┊ GC (min … max): 0.00% … 80.89%
Time (median): 2.645 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.770 ms ± 7.382 ms ┊ GC (mean ± σ): 19.06% ± 9.26%
█▁
██▅▁▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▇ ▇
2.35 ms Histogram: log(frequency) by time 43.7 ms <
Memory estimate: 1.84 MiB, allocs estimate: 47709.
=#
So, without multithreading your observed differences in timings are real. With threading, the results are a bit closer but still not what I'd like. I am quite impressed with how matplotlib must be doing this to get this so fast. The number of allocations with NaturalNeighbours.jl I'm getting seems to suggest there is a type instability somewhere in the code. I'll do some digging.
I did find one issue #35, and another that can't be fixed without a new major release, but they don't make much difference. I'm honestly not sure what the main difference is coming from and I can't tell exactly what Matplotlib is doing.
I know little about how Matplotlib achieves this. This is probably the interface of the method: https://github.com/matplotlib/matplotlib/blob/be56634d682bed257cb941369d8d3600635ddadf/lib/matplotlib/tri/_triinterpolate.py#L232
I probably can't look at that for thinking of changes depending on Matplotlib's license, I'd have to double check that. I was thinking though: I've had to do something similar previously in FiniteVolumeMethod.jl which essentially uses this interpolation, and there was a bit of a difference between (1) storing the barycentric coordinates (these are the coordinates (a, b, c)
defining the interpolant fT(x, y) = a x + b y + c
in a triangle T
) for each triangle to re-use or (2) computing the barycentric coordinates each time on demand. I ended up deciding to store them (https://github.com/SciML/FiniteVolumeMethod.jl/blob/9dc2612219b95eb49b4a81d8cace441f10b6ded4/src/geometry.jl#L21).
Maybe matplotlib stores it as well? I wonder if I should add some sort of cache into NaturalNeighboursInterpolant
for storing results relating to a certain interpolant, e.g. something like
struct InterpolantCache{T1<:TriangleCache,T2 <:SibsonCache}
triangle::T1
sibson::T2 # not sure if sibson needs a cache actually
... other methods ...
end
struct TriangleCache{T<:Triangulation}
coefficients::Dict{NTuple{3,Int},NTuple{3,Float64}}
end
...
that the interpolant could then leverage if the user asks to store this (since they might not always want to for larger problems where the data would be a bit much). The caches would be empty until a user calls itp
with the specific method.
Well, I got around to looking at caching - see https://github.com/DanielVandH/NaturalNeighbours.jl/tree/triangle. It doesn't really make things much better unfortunately.
julia> @benchmark nn_benchmark()
BenchmarkTools.Trial: 443 samples with 1 evaluation.
Range (min … max): 9.076 ms … 426.486 ms ┊ GC (min … max): 0.00% … 59.54%
Time (median): 9.986 ms ┊ GC (median): 0.00%
Time (mean ± σ): 11.203 ms ± 20.125 ms ┊ GC (mean ± σ): 6.15% ± 4.00%
▃▇▅▂▃▄█▂
▃▃▃▃▇████████▇█▅▃▃▂▃▁▃▂▁▃▁▁▁▁▁▁▃▂▂▃▁▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
9.08 ms Histogram: frequency by time 14.8 ms <
Memory estimate: 1.84 MiB, allocs estimate: 47159.
julia> @benchmark mpl_benchmark()
BenchmarkTools.Trial: 2906 samples with 1 evaluation.
Range (min … max): 1.079 ms … 263.383 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.594 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.714 ms ± 4.880 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄██▇█▆▅▃▃▆▂▅▄▄▃▅▅▆▆█▄▅▅▄▂▁
▁▁▂███████████████████████████████▆▆▄▅▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁ ▅
1.08 ms Histogram: frequency by time 2.6 ms <
Memory estimate: 237.89 KiB, allocs estimate: 74.
julia> function nn_benchmark_threads() # I'm using 10 threads
itp = interpolate(x, y, z; derivatives=false)
X, Y = meshgrid(xg, yg)
_x, _y = vec(X), vec(Y)
return itp(_x, _y; method=Triangle(; allow_cache = true), parallel=true)
end
nn_benchmark_threads (generic function with 1 method)
julia> @benchmark nn_benchmark_threads()
BenchmarkTools.Trial: 1296 samples with 1 evaluation.
Range (min … max): 2.343 ms … 186.348 ms ┊ GC (min … max): 0.00% … 59.95%
Time (median): 2.646 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.845 ms ± 11.402 ms ┊ GC (mean ± σ): 18.90% ± 6.32%
▁█▇▂
▃████▆▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂ ▃
2.34 ms Histogram: frequency by time 7.45 ms <
Memory estimate: 1.87 MiB, allocs estimate: 47579.
I'm not going to commit the cached Triangle
work in the linked branch since the difference isn't too significant and it's a bit clunky.
The only other difference could come from DelaunayTriangulation.jl's jump_and_march
(i.e. how point location is done to find the triangle containing the query point). Unfortunately jump_and_march
is a bit of a headache to work with so not sure what more to do there, as the function is extremely complicated and the reasons for its speed would be either
One idea to improve jump_and_march
uses within itp
is to come up with a better way to sort the points in space so that nearby points are queried in sequence, so jump_and_march
is started at a point that is very close to the triangle containing it. This is a bit complicated to do, especially with multithreading (unless we use e.g. a spatial indexing tree), so not sure I want to commit to looking into that (and I don't have the time probably).
All that said, if you are happy with the comparison between the performance with parallel=true
compared to Matplotlib as I've demonstrated above, feel free to close this. Otherwise we can leave this open incase someday improvements can be made. I suspect it's an issue of DelaunayTriangulation.jl using exact predicates while Matplotlib probably doesn't.
For evaluating at many points, NaturalNeighbours.jl does seem to win with multithreading at least. You can run the above tests with
const xg = LinRange(0, 1, 250)
const yg = LinRange(0, 1, 250)
and the multithreading benchmark is 2x faster than Matplotlib, although 10x worse without.
e: Actually, it's 2x slower without caching.. Okay, I'm going to push that caching. You should use allow_cache=true
in your Triangle()
calls once v1.4.0 is released.
Thanks for your effort! I would suggest leave this open as a reminder if someone intends to refactor the code and speed it up~
The caching work should be registered soon (~15 minutes) as v1.3.4. By default, it will now cache the necessary coordinates for Triangle()
(so you don't actually need allow_cache=true
as I suggested above) and hopefully give you some better performance! Thanks for the report @henry2004y.
@henry2004y With DelaunayTriangulation 1.1.0 you can more easily change how predicates are computed by using the predicates
keyword, which might also help your performance. e.g.
julia> function nn_benchmark(kernel)
itp = interpolate(x, y, z; derivatives=false)
X, Y = meshgrid(xg, yg)
_x, _y = vec(X), vec(Y)
return itp(_x, _y; method=Triangle(), parallel=true, predicates = kernel)
end
nn_benchmark (generic function with 2 methods)
julia> @benchmark nn_benchmark($FastKernel())
BenchmarkTools.Trial: 270 samples with 1 evaluation.
Range (min … max): 12.527 ms … 27.177 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 18.385 ms ┊ GC (median): 0.00%
Time (mean ± σ): 18.495 ms ± 1.480 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▂▆▇█▄▇▂
▃▁▁▁▃▁▁▁▁▁▁▃▃▁▁▃▃▁▁▃▁▃▁▄▆▆▇██████████▇▇▅▆▄▄▅▃▃▄▁▃▃▁▃▁▃▁▁▁▃▃ ▃
12.5 ms Histogram: frequency by time 23.2 ms <
Memory estimate: 485.34 KiB, allocs estimate: 889.
julia> @benchmark nn_benchmark($AdaptiveKernel()) # the new default
BenchmarkTools.Trial: 170 samples with 1 evaluation.
Range (min … max): 24.946 ms … 40.503 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 29.362 ms ┊ GC (median): 0.00%
Time (mean ± σ): 29.488 ms ± 2.992 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▂ ▃▅█ ▃▂▃▅ ▂ ▅ ▆█▅▅▃▂ ▃
█▄█████▁▄███████▇▄████████▅▄▅▄█▅▄▅▅▇▅▅▄▄▁▄▇▁▁▄▁▄▁▄▁▁▁▁▁▁▁▁▄ ▄
24.9 ms Histogram: frequency by time 38.9 ms <
Memory estimate: 484.62 KiB, allocs estimate: 889.
julia> @benchmark nn_benchmark($ExactKernel()) # the old default
BenchmarkTools.Trial: 163 samples with 1 evaluation.
Range (min … max): 24.614 ms … 300.534 ms ┊ GC (min … max): 0.00% … 47.87%
Time (median): 28.723 ms ┊ GC (median): 0.00%
Time (mean ± σ): 31.088 ms ± 21.944 ms ┊ GC (mean ± σ): 3.57% ± 4.94%
▃ ▇▁█▇▃▃▃▇▇▆▁ ▁▂
▃▃████████████████▇▄▄▃▇▆▃▆▃▄▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
24.6 ms Histogram: frequency by time 46.6 ms <
Memory estimate: 1.38 MiB, allocs estimate: 35231.
Thank you! I'll give it a try~
Hi! Thanks for the great package!
As a user, I want to report a performance comparison I found against Matplotlib:
This is running with 1 thread, but NaturalNeighbours.jl is ~ 16 times slower. I don't know if Matplotlib automatically uses multithreading, but this performance difference is noticeable. I don't trust the allocation statistics reported above for Matplotlib, but the measured time should be ok.