Launch configuration: use ZE_extension_kernel_max_group_size_properties

maleadt commented 4 months ago

With prime-sized inputs the suggested group size always consists of only a single thread:

julia> k = @oneapi launch=false identity(nothing)

julia> oneL0.suggest_groupsize(k.fun, 521)
oneAPI.oneL0.ZeDim3(1, 1, 1)

julia> oneL0.suggest_groupsize(k.fun, 7877)
oneAPI.oneL0.ZeDim3(1, 1, 1)

julia> oneL0.suggest_groupsize(k.fun, 7919)
oneAPI.oneL0.ZeDim3(1, 1, 1)

But also with non prime-sized inputs the configuration looks highly suboptimal:

julia> oneL0.suggest_groupsize(k.fun, 8000)
oneAPI.oneL0.ZeDim3(64, 1, 1)

(this kernel can launch groups of 512 threads on this system)

Maybe I'm misinterpreting the use of this API? I thought it was a counterpart of the CUDA occupancy API (cuOccupancyMaxPotentialBlockSize), suggesting a groupsize that accomplishes a reasonable occupancy.

maleadt commented 4 months ago

Filed upstream: https://github.com/intel/compute-runtime/issues/725

maleadt commented 4 months ago

As noted by upstream, this is expected; the suggested launch configuration exactly covers the input space. Since we don't care about this, using bounds checks at run time, we can use more relaxed launch configurations. A workaround is implemented in https://github.com/JuliaGPU/oneAPI.jl/pull/431, but once there's a new driver release we should use the Level Zero extension to query the maximum launch configuration for a given kernel.

maleadt commented 3 months ago

Fixed by https://github.com/JuliaGPU/oneAPI.jl/pull/431

JuliaGPU / oneAPI.jl

Launch configuration: use ZE_extension_kernel_max_group_size_properties #430