Loop Blocking for fn GPU Backend

fthaler commented 4 months ago

Implements loop blocking for the GPU fn backend. Thread block size (that is, CUDA/HIP threads per block) and loop block size (that is, loop iterations per CUDA/HIP thread) can now be specified as template parameters.

Further changes:

Set __launch_bounds__ in the fn GPU kernel based on the thread block size.
Activate vertical loop blocking in the fn nabla kernels on newer CUDA versions that support GT_PROMISE.

Performance changes:

__launch_bounds__ affects performance of the fn_cartesian_vertical_advection benchmark significantly (positively or negatively, depending on domain size).
Performance of fn nabla benchmarks improves significantly on newer CUDA versions.
Performance on Daint is currently reduced due to too old CUDA version.

gridtoolsjenkins commented 4 months ago

Hi there, this is jenkins continuous integration... Do you want me to verify this patch?

fthaler commented 4 months ago

launch jenkins

havogt commented 4 months ago

launch perftests

havogt commented 4 months ago

launch jenkins

fthaler commented 4 months ago

launch jenkins

fthaler commented 4 months ago

launch jenkins

fthaler commented 4 months ago

launch perftest

fthaler commented 4 months ago

launch jenkins

fthaler commented 4 months ago

launch perftest

fthaler commented 4 months ago

launch perftest

fthaler commented 4 months ago

All tests passed, apart from ault/HIP which is offline.

fthaler commented 1 month ago

launch perftest

fthaler commented 1 month ago

launch perftest

fthaler commented 1 month ago

launch perftest

fthaler commented 1 month ago

launch perftest

fthaler commented 1 month ago

launch jenkins

fthaler commented 1 month ago

launch perftest

fthaler commented 2 weeks ago

launch perftest

fthaler commented 2 weeks ago

launch jenkins

fthaler commented 2 weeks ago

launch jenkins

fthaler commented 2 weeks ago

launch perftest

GridTools / gridtools

Loop Blocking for fn GPU Backend #1787