ecmwf-ifs / fiat

The Fortran IFS and Arpege Toolkit
Apache License 2.0
9 stars 24 forks source link

Add NVTX support via DrHook #23

Open pmarguinaud opened 4 months ago

pmarguinaud commented 4 months ago

This PR enables NVTX instrumentation using DrHook (PGI only). This is enabled with the two following environment variables :

export DR_HOOK=1
export DR_NVTX=1

This code has been provided courtesy of Louis Stuber (NVIDIA) and Lukas Mosimann (NVIDIA).

Example of an ARPEGE nsys profile with NVTX enabled :

nvtx-color

wdeconinck commented 4 months ago

Hi Philippe thanks for pushing this, looks like a great feature!!!

A comment perhaps is to rename the module names and subroutines (C,Fortran) with prefix "my..." to something more unambiguous like "fiat", or better "drhook" ?

Another suggestion is to keep this feature optional even when using the NVIDIA compiler:

ecbuild_add_option( DR_HOOK_NVTX  ... )
# or instead : ecbuild_add_option( NVTX  ... )

The DEFAULT could be ON for this option when NVIDIA compiler is detected if you wish. The nvtx files should then be compiled only when this feature is ON (HAVE_DR_HOOK_NVTX or HAVE_NVTX). Then add the definition:

if( HAVE_DR_HOOK_NVTX OR HAVE_NVTX )
   list( APPEND FIAT_DEFINITIONS DR_HOOK_NVTX )
endif()

And guard the specific code that is currently guarded with __PGI with DR_HOOK_NVTX

A slightly larger refactoring, but useful, however is that drhook is also to be used from C code. This means not to call from dr_hook_util.F90, but from within drhook.c Then we should be enabling this via the environment variable DR_HOOK_OPT=NVTX,...,... which is currently used to enable/disable DR_HOOK options. The Fortran drivers for nvtx are then perhaps not even be required, unless you want to expose this outside of DR_HOOK as well.

@ioanhadade , @marsdeno could you also review please?

pmarguinaud commented 4 months ago

Hello Willem,

I followed your advice and renamed my* to dr_nvtx, then added the DR_NVTX option and I now use the HAVE_DR_NVTX macro everywhere instead of __PGI.

All of this does not work with nvhpc/21.9 (I tried it on our cluster), but it works with nvhpc/24.3; I tried changing that in ./.github/tools/install-nvhpc.sh, but github tests fail with "No space left on device".

What should I do ?

reuterbal commented 4 months ago

I have seen a similar issue with nvhpc in Github actions. For CLOUDSC we use this "hack": https://github.com/ecmwf-ifs/dwarf-p-cloudsc/blob/95125c267e5baed113ef7671c9f346979bd84029/.github/workflows/build.yml#L110-L127

pmarguinaud commented 3 months ago

I followed Balthasar advice, but it did not work. So CI is failing for nvhpc builds and I do not know what to do about it.

wdeconinck commented 3 months ago

I followed Balthasar advice, but it did not work. So CI is failing for nvhpc builds and I do not know what to do about it.

@pmarguinaud I will have a look at making this work. In the mean time, please, could you look into getting nvtx to work via drhook.c ? We could then e.g. use DR_HOOK_OPT=NVTX also for C/C++ programs using DR_HOOK. And also a unit-test would be great, at least to catch catastrophic errors.

wdeconinck commented 3 months ago

Hi @pmarguinaud , in the mean time I have updated the develop branch to use nvhpc-24.3 in github actions. It also sets "NVHPC_ROOT" environment variable which you need for the find_package(NVHPC) to work.

You can now rebase this branch, minus the added GitHub actions changes, on develop.

pmarguinaud commented 3 months ago

I have merged develop and added a test case for NVTX.

Instrumenting C code is not a priority for me, but you can do it yourself if it is important for you.

wdeconinck commented 1 month ago

Hi @Andrew-Beggs-ECMWF discussing offline with @ioanhadade, @ioanhadade suggested for you to have a look at the possibility of integrating this within drhook itself.

Andrew-Beggs-ECMWF commented 3 weeks ago

I've now finished my patch and have it as a branch on my own fork of fiat. Do we want to have it merged into pmarguinaud's branch (which should then make it part of this pr once it's merged) or start a new pr directly to ecmwf-ifs:develop?

The outcome is the same, just a matter of how it's merged : )

wdeconinck commented 2 weeks ago

Let's make a new PR for it please.