Add a label to ParallelFor that is forwarded to TinyProfiling

DerNils-git commented 2 months ago

The most expensive regions of a simulation are (probably) wrapped in an amrex::ParallelFor. The overhead of wrapping amrex::ParallelFor into a tiny profiling call is admittedly not very high. But it could simplify quick profiling of AMReX based codes if a label could be added directly to an amrex::ParallelFor.

Enforcing the label would probably require to much effort for existing codes to adapt to the new interface. But the ParallelFor could be extended by an optional string as the last argument which would be forwarded to a the tiny profiling calls if the label is provided. Current usage:

ParallelFor(box, numcomps,
            [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { ... });

Extended usage:

ParallelFor(box, numcomps,
            [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { ... },"MyParallelKernel");

A possible default argument could be

an empty string. Then ParallelFor is only profiled if a label is given explicitly.
a default string. Then all ParallelFor regions wo a label are still profiled and the measurement always provides insights into the time which is spend in a shared memory kernels.

WeiqunZhang commented 2 months ago

We actually have versions of ParallelFor functions that were added with this in mind. https://github.com/AMReX-Codes/amrex/blob/b752027c1aebdfb4be339b1e30932b4108286a7a/Src/Base/AMReX_GpuLaunchFunctsG.H#L239 Here, Gpu::KernelInfo is defined at https://github.com/AMReX-Codes/amrex/blob/aba9891ec5edbf4c6fbd23fb072662c3bb33223e/Src/Base/AMReX_GpuKernelInfo.H#L7. We could add a non-explicit constructor to Gpu::KernelInfo. Then we can do

ParallelFor("MyKernelName", box, [=] ....);

But we still need to decide what to do with the string. Note that by default the kernel launch is async. So if we use a TinyProfiler in ParallelFor, we will have to set tiny_profiler.device_synchronize_around_region=true to make the kernel launch synchronous. But that might hurt performance. Alternatively, we can just call nvtx and roctx's RangePush and RangePop. The vendor's profiling tools will get the names.

@DerNils-git Since you mentioned kokkos. Do you know what they do?

DerNils-git commented 2 months ago

If I understand the approach of the above ParallelFor correctly, a developer has to decide at the time of implementation whether a kernel should be profiled or not. In a performance analysis it would be nice to get all kernel information by default and not depend on the developers decision, if a kernel should be profiled or not. I might not necessarily know which kernels are invoked at all during a simulation. So by default getting the information on all invoked kernels at runtime would be nice.

I am only a user of the Kokkos profiling tools. So I can only guess a bit. Whether profiling tools should be used is be decided at runtime not at compile time. The profiling library has the option to synchronize kernels, if I interpret the code correctly. But it depends on the profiler whether kernels are synchronized or not. That not a lot of information but looking closer into the implementation details would be a bit to much overhead for now, sorry.

AMReX-Codes / amrex

Add a label to ParallelFor that is forwarded to TinyProfiling #3899