JuliaGPU / AcceleratedKernels.jl

Cross-architecture parallel algorithms for Julia's CPU and GPU backends. Targets multithreaded CPUs, and GPUs via Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA.
https://juliagpu.github.io/AcceleratedKernels.jl/
MIT License
86 stars 1 forks source link

Support for a `:serial` scheduler #7

Closed SteffenPL closed 2 weeks ago

SteffenPL commented 2 weeks ago

Thanks for the amazing package.

I'm currently thinking of using it in https://github.com/SteffenPL/SpatialHashTables.jl for all the parallelisation.

One question I faced was that it would be nice to have a single thread :serial scheduler for benchmarking and also in cases where depending on the context a single thread version is preferable?

PS: I saw that your start-up also deals with particular simulations, maybe spatial hash tables are also useful for you.

anicusan commented 2 weeks ago

Hi, thanks for using it! You can set num_tasks=1 when using CPU arrays like Vector, which is effectively a serial scheduler; the codepath is different to num_tasks>1, it launches no Task, does not allocate any additional memory for the tasks / scheduler, and runs on the main Julia process. Or is there anything else you'd need from a :serial scheduler?

Your package looks really nice - we are very interested in acceleration data structures for Lagrangian simulations. I see you've benchmarked your code against CellListMap and Nvidia Warp HashGrid, I would be very interested in a benchmark against ImplicitBVH.jl, which we are using currently (we're still prototyping a simulation code with it, but it will be open-sourced; to the grief of our business advisors, I believe the science should be transparent - commercial projects using that science should be, well, commercial).

I am not sure how to use the code in your benchmarks/report folder, and I don't have a CUDA machine on hand - would you mind trying ImplicitBVH.jl on the same machine as your other benchmarks? Try the version on GitHub with pkg> add ImplicitBVH#main with something like:

using ImplicitBVH
using ImplicitBVH: BBox, BSphere
import AcceleratedKernels as AK

function get_interacting_pairs(particle_centers::AbstractMatrix, cutoff::AbstractFloat)

    # Construct bounding sphere around each particle of `cutoff` radius
    num_particles = size(particle_centers, 2)
    bounding_volumes = similar(particle_centers, BSphere{Float32}, num_particles)
    AK.foreachindex(bounding_volumes) do i
        bounding_volumes[i] = BSphere{Float32}(
            (particle_centers[1, i], particle_centers[2, i], particle_centers[3, i]),
            cutoff,
        )
    end

    # Construct BVH, merging BSpheres into BBoxes, and using 32-bit Morton indices
    bvh = BVH(bounding_volumes, BBox{Float32}, UInt32, default_start_level(num_particles))

    # Traverse BVH - this returns a BVHTraversal
    contacting_pairs = traverse(bvh)

    # Return Vector{Tuple{Int32, Int32}} of particle index pairs
    contacting_pairs.contacts
end

# Use the particles in your benchmarks - I assume they're column-ordered in memory
particle_centers = rand(3, 1000)
interacting_pairs = get_interacting_pairs(particle_centers, 0.05)

# Example output: 
#   julia> @show interacting_pairs
#   interacting_pairs = Tuple{Int32, Int32}[(369, 667), (427, 974), ...]
anicusan commented 2 weeks ago

Your comment made me want to check our serial execution performance - to see that foreachindex with 1 task is equivalent to serial execution, you can test it with the simplest operation, an elementwise copy:

using BenchmarkTools
import AcceleratedKernels as AK

using Random
Random.seed!(0)

function simplecopyto!(x)
    v = similar(x)
    for i in eachindex(x)
        @inbounds v[i] = x[i]
    end
end

function akcopyto!(x, scheduler)
    v = similar(x)
    AK.foreachindex(x, scheduler=scheduler, max_tasks=1) do i
        @inbounds v[i] = x[i]
    end
end

x = rand(Int32, 1_000_000)

println("\nSimple serial copy:")
display(@benchmark(simplecopyto!(x)))

println("\nAcceleratedKernels foreachindex :polyester copy:")
display(@benchmark(akcopyto!(x, :polyester)))

println("\nAcceleratedKernels foreachindex :threads copy:")
display(@benchmark(akcopyto!(x, :threads)))

For which I get:

Simple serial copy:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   47.333 μs …   4.463 ms  ┊ GC (min … max):  0.00% … 91.66%
 Time  (median):     188.167 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   270.899 μs ± 217.843 μs  ┊ GC (mean ± σ):  27.27% ± 24.65%

         ▁▆▆█▅▄▄▃▃▂▂                    ▁▁▁▁▂▂▂▁         ▁▁▁    ▂
  ▅▁▁▁▁▁▁██████████████▇▆▆▅▄▃▃▃▃▄▃▁▅▇█▇██████████▇▇▆▆▇███████▇▇ █
  47.3 μs       Histogram: log(frequency) by time        883 μs <

 Memory estimate: 3.81 MiB, allocs estimate: 2.

AcceleratedKernels foreachindex :polyester copy:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   45.167 μs …   5.364 ms  ┊ GC (min … max):  0.00% … 94.59%
 Time  (median):     190.875 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   285.839 μs ± 258.745 μs  ┊ GC (mean ± σ):  28.57% ± 24.72%

       ▃▆█▆▄▄▃▃▂▁            ▁▃▂▂▁▁ ▁▁   ▁                      ▂
  ▅▁▁▁▁███████████▇▆▅▄▁▅▅▁▅▇████████████████▆▇▆▆▅▆▆▆▅▅▆▄▅▄▅▅▄▄▃ █
  45.2 μs       Histogram: log(frequency) by time       1.24 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 3.

AcceleratedKernels foreachindex :threads copy:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   46.000 μs …   3.680 ms  ┊ GC (min … max):  0.00% … 94.46%
 Time  (median):     187.875 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   275.786 μs ± 222.267 μs  ┊ GC (mean ± σ):  28.56% ± 25.10%

         ▅█▆▅▄▃▂▂▂▁                   ▃▃▂▁       ▁▁▁            ▂
  ▆▃▁▁▁▁▆██████████▇▆▄▅▃▄▁▄▃▁▁▃▁▁▅▇▇▇▇███████▇▇██████▇▆▆▅▆▇▆▅▃▅ █
  46 μs         Histogram: log(frequency) by time       1.02 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 3.

Note the times are effectively the same, as well as the memory used.

SteffenPL commented 2 weeks ago

Hi, thank you for your detailed replies!

I'll probably be a bit slow with answering all the follow-ups, therefore, a quick reply first:

I will definitly try to benchmark it and maybe in the process also make it a bit easier to run independent benchmarks! I'm not sure how quickly I can do that, as I'm a bit running behind on other unrelated projects here... But I am very curious.

(PS: I hope that the open source community will benefit you. I would be happy to contribute in the future to AcceleratedKernels or ImplicitBVH if I am able to!)

[I closed the issue as the main question seems to be answered. Anyway, happy to stay in contact!]