There are some kernels where tunings with the same name are not implemented the same way between Base and RAJA variants. Work on the implementations to make them the same or add tunings to have an apples to apples comparison between Base and RAJA variants.
Affected kernels/algorithms:
[x] INDEXLIST_3LOOP - Base variants read outcomes of scans but RAJA variants use reductions #370
[x] Reducers - Base reducers do a block reduction then an atomic per block to finalize the reduction but RAJA reducers do a block reduction then the last block finalizes the reduction #393
[x] LCALS_FIRST_MIN - Base reducers are finalized on the host but RAJA reducers are finalized in the last block #398
[ ] Reducers - Base reducers block atomics are into a contiguous buffer so have false sharing but RAJA reducers block atomics are into different buffers so they may avoid false sharing
[x] Reducers - Base reducers use device memory and explicit memory copies but RAJA reducers use pinned memory #392
[x] MEMSET/MEMCPY - Base used stream 0 but RAJA used a different stream #296
[x] HALOEXCHANGE_FUSED - Base uses direct dispatch but RAJA uses indirect function call dispatch #260
Other things affecting performance:
HALOEXCHANGE_FUSED - RAJA variants have dynamic scratch memory usage, lower hipLimitStackSize or set env HSA_SCRATCH_SINGLE_LIMIT=240000000 (MI250X) to avoid dynamic scratch memory allocation
Reducers - RAJA variants don't always inline, use compiler flags from hipcc (-mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false) or increase inline threshold (-fgpu-inline-threshold=100000)
There are some kernels where tunings with the same name are not implemented the same way between Base and RAJA variants. Work on the implementations to make them the same or add tunings to have an apples to apples comparison between Base and RAJA variants.
Affected kernels/algorithms:
Other things affecting performance:
hipLimitStackSize
or set envHSA_SCRATCH_SINGLE_LIMIT=240000000
(MI250X) to avoid dynamic scratch memory allocation-mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false
) or increase inline threshold (-fgpu-inline-threshold=100000
)