JuliaGPU / KernelAbstractions.jl

Heterogeneous programming in Julia
MIT License
379 stars 66 forks source link

On CPU always use `NoDynamicCheck()`, just finish the last partial workgroup with `DynamicCheck()` #449

Open rafaqz opened 10 months ago

rafaqz commented 10 months ago

Given that DynamicCheck() breaks SIMD this can be an order of magnitude faster for some inexpensive tasks.

I'll write up a better MWE, but this is the scale of it - a single threaded game of life in DynamicGrids.jl (basically summing a 3x3 window over Bool) is 2x faster than an 8 core KernelAbstractions.jl sim pretty much just from DynamicCheck():

julia> using DynamicGrids, BenchmarkTools

julia> init = rand(Bool, 1000, 1000);

julia> output = ResultOutput(init; tspan=1:200);

julia> @btime sim!($output, Life(); proc=SingleCPU());
  338.058 ms (6459 allocations: 3.25 MiB)

julia> @btime sim!($output, Life(); proc=CPUGPU());
  652.198 ms (18401 allocations: 4.63 MiB)
rafaqz commented 10 months ago

It seems DynamicCheck is only half the problem - it helps a lot removing it, but something else is also blocking the compiler constant propagating size information (its like a sized array) from the type through the KernelAbstractions kernel that it can see in the single threaded version.

I will have to fix it to find out what the problem is, so will probably submit a PR sometime.