Closed SteffenPL closed 2 weeks ago
Hi, thanks for using it! You can set num_tasks=1
when using CPU arrays like Vector
, which is effectively a serial scheduler; the codepath is different to num_tasks>1
, it launches no Task
, does not allocate any additional memory for the tasks / scheduler, and runs on the main Julia process. Or is there anything else you'd need from a :serial
scheduler?
Your package looks really nice - we are very interested in acceleration data structures for Lagrangian simulations. I see you've benchmarked your code against CellListMap and Nvidia Warp HashGrid, I would be very interested in a benchmark against ImplicitBVH.jl, which we are using currently (we're still prototyping a simulation code with it, but it will be open-sourced; to the grief of our business advisors, I believe the science should be transparent - commercial projects using that science should be, well, commercial).
I am not sure how to use the code in your benchmarks/report
folder, and I don't have a CUDA machine on hand - would you mind trying ImplicitBVH.jl on the same machine as your other benchmarks? Try the version on GitHub with pkg> add ImplicitBVH#main
with something like:
using ImplicitBVH
using ImplicitBVH: BBox, BSphere
import AcceleratedKernels as AK
function get_interacting_pairs(particle_centers::AbstractMatrix, cutoff::AbstractFloat)
# Construct bounding sphere around each particle of `cutoff` radius
num_particles = size(particle_centers, 2)
bounding_volumes = similar(particle_centers, BSphere{Float32}, num_particles)
AK.foreachindex(bounding_volumes) do i
bounding_volumes[i] = BSphere{Float32}(
(particle_centers[1, i], particle_centers[2, i], particle_centers[3, i]),
cutoff,
)
end
# Construct BVH, merging BSpheres into BBoxes, and using 32-bit Morton indices
bvh = BVH(bounding_volumes, BBox{Float32}, UInt32, default_start_level(num_particles))
# Traverse BVH - this returns a BVHTraversal
contacting_pairs = traverse(bvh)
# Return Vector{Tuple{Int32, Int32}} of particle index pairs
contacting_pairs.contacts
end
# Use the particles in your benchmarks - I assume they're column-ordered in memory
particle_centers = rand(3, 1000)
interacting_pairs = get_interacting_pairs(particle_centers, 0.05)
# Example output:
# julia> @show interacting_pairs
# interacting_pairs = Tuple{Int32, Int32}[(369, 667), (427, 974), ...]
Your comment made me want to check our serial execution performance - to see that foreachindex
with 1 task is equivalent to serial execution, you can test it with the simplest operation, an elementwise copy:
using BenchmarkTools
import AcceleratedKernels as AK
using Random
Random.seed!(0)
function simplecopyto!(x)
v = similar(x)
for i in eachindex(x)
@inbounds v[i] = x[i]
end
end
function akcopyto!(x, scheduler)
v = similar(x)
AK.foreachindex(x, scheduler=scheduler, max_tasks=1) do i
@inbounds v[i] = x[i]
end
end
x = rand(Int32, 1_000_000)
println("\nSimple serial copy:")
display(@benchmark(simplecopyto!(x)))
println("\nAcceleratedKernels foreachindex :polyester copy:")
display(@benchmark(akcopyto!(x, :polyester)))
println("\nAcceleratedKernels foreachindex :threads copy:")
display(@benchmark(akcopyto!(x, :threads)))
For which I get:
Simple serial copy:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 47.333 μs … 4.463 ms ┊ GC (min … max): 0.00% … 91.66%
Time (median): 188.167 μs ┊ GC (median): 0.00%
Time (mean ± σ): 270.899 μs ± 217.843 μs ┊ GC (mean ± σ): 27.27% ± 24.65%
▁▆▆█▅▄▄▃▃▂▂ ▁▁▁▁▂▂▂▁ ▁▁▁ ▂
▅▁▁▁▁▁▁██████████████▇▆▆▅▄▃▃▃▃▄▃▁▅▇█▇██████████▇▇▆▆▇███████▇▇ █
47.3 μs Histogram: log(frequency) by time 883 μs <
Memory estimate: 3.81 MiB, allocs estimate: 2.
AcceleratedKernels foreachindex :polyester copy:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 45.167 μs … 5.364 ms ┊ GC (min … max): 0.00% … 94.59%
Time (median): 190.875 μs ┊ GC (median): 0.00%
Time (mean ± σ): 285.839 μs ± 258.745 μs ┊ GC (mean ± σ): 28.57% ± 24.72%
▃▆█▆▄▄▃▃▂▁ ▁▃▂▂▁▁ ▁▁ ▁ ▂
▅▁▁▁▁███████████▇▆▅▄▁▅▅▁▅▇████████████████▆▇▆▆▅▆▆▆▅▅▆▄▅▄▅▅▄▄▃ █
45.2 μs Histogram: log(frequency) by time 1.24 ms <
Memory estimate: 3.81 MiB, allocs estimate: 3.
AcceleratedKernels foreachindex :threads copy:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 46.000 μs … 3.680 ms ┊ GC (min … max): 0.00% … 94.46%
Time (median): 187.875 μs ┊ GC (median): 0.00%
Time (mean ± σ): 275.786 μs ± 222.267 μs ┊ GC (mean ± σ): 28.56% ± 25.10%
▅█▆▅▄▃▂▂▂▁ ▃▃▂▁ ▁▁▁ ▂
▆▃▁▁▁▁▆██████████▇▆▄▅▃▄▁▄▃▁▁▃▁▁▅▇▇▇▇███████▇▇██████▇▆▆▅▆▇▆▅▃▅ █
46 μs Histogram: log(frequency) by time 1.02 ms <
Memory estimate: 3.81 MiB, allocs estimate: 3.
Note the times are effectively the same, as well as the memory used.
Hi, thank you for your detailed replies!
I'll probably be a bit slow with answering all the follow-ups, therefore, a quick reply first:
Setting nthreads = 1
is a great and I must have overlooked in the code that it uses a different track. Perfect.
Thanks for pointing me to your package ImplicitBVH.jl! With these tools, I feel that Julia is really catching up to Warp & Co! That's great.
I will definitly try to benchmark it and maybe in the process also make it a bit easier to run independent benchmarks! I'm not sure how quickly I can do that, as I'm a bit running behind on other unrelated projects here... But I am very curious.
(PS: I hope that the open source community will benefit you. I would be happy to contribute in the future to AcceleratedKernels or ImplicitBVH if I am able to!)
[I closed the issue as the main question seems to be answered. Anyway, happy to stay in contact!]
Thanks for the amazing package.
I'm currently thinking of using it in https://github.com/SteffenPL/SpatialHashTables.jl for all the parallelisation.
One question I faced was that it would be nice to have a single thread
:serial
scheduler for benchmarking and also in cases where depending on the context a single thread version is preferable?PS: I saw that your start-up also deals with particular simulations, maybe spatial hash tables are also useful for you.