Open DerNils-git opened 2 months ago
We actually have versions of ParallelFor functions that were added with this in mind. https://github.com/AMReX-Codes/amrex/blob/b752027c1aebdfb4be339b1e30932b4108286a7a/Src/Base/AMReX_GpuLaunchFunctsG.H#L239 Here, Gpu::KernelInfo
is defined at https://github.com/AMReX-Codes/amrex/blob/aba9891ec5edbf4c6fbd23fb072662c3bb33223e/Src/Base/AMReX_GpuKernelInfo.H#L7. We could add a non-explicit constructor to Gpu::KernelInfo
. Then we can do
ParallelFor("MyKernelName", box, [=] ....);
But we still need to decide what to do with the string. Note that by default the kernel launch is async. So if we use a TinyProfiler in ParallelFor, we will have to set tiny_profiler.device_synchronize_around_region=true
to make the kernel launch synchronous. But that might hurt performance. Alternatively, we can just call nvtx
and roctx
's RangePush
and RangePop
. The vendor's profiling tools will get the names.
@DerNils-git Since you mentioned kokkos. Do you know what they do?
If I understand the approach of the above ParallelFor
correctly, a developer has to decide at the time of implementation whether a kernel should be profiled or not.
In a performance analysis it would be nice to get all kernel information by default and not depend on the developers decision, if a kernel should be profiled or not. I might not necessarily know which kernels are invoked at all during a simulation. So by default getting the information on all invoked kernels at runtime would be nice.
I am only a user of the Kokkos profiling tools. So I can only guess a bit. Whether profiling tools should be used is be decided at runtime not at compile time. The profiling library has the option to synchronize kernels, if I interpret the code correctly. But it depends on the profiler whether kernels are synchronized or not. That not a lot of information but looking closer into the implementation details would be a bit to much overhead for now, sorry.
The most expensive regions of a simulation are (probably) wrapped in an
amrex::ParallelFor
. The overhead of wrappingamrex::ParallelFor
into a tiny profiling call is admittedly not very high. But it could simplify quick profiling of AMReX based codes if a label could be added directly to anamrex::ParallelFor
.Enforcing the label would probably require to much effort for existing codes to adapt to the new interface. But the
ParallelFor
could be extended by an optional string as the last argument which would be forwarded to a the tiny profiling calls if the label is provided. Current usage:Extended usage:
A possible default argument could be
ParallelFor
is only profiled if a label is given explicitly.ParallelFor
regions wo a label are still profiled and the measurement always provides insights into the time which is spend in a shared memory kernels.