RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.65k stars 226 forks source link

[FeatureRequest] Add domain for HWthreads closest to GPUs #534

Closed JanLJL closed 12 hours ago

JanLJL commented 1 year ago

Is your feature request related to a problem? Please describe. Often, GPUs are not closest to the NUMA domain a humain might think (e.g., GPU 3 is closest to NUMA domain 0, etc). Not every user remembers to run likwid-topology first to get the corresponding NUMA domains for their GPU(s).

Describe the solution you'd like Add a affinity domain for likwid-pin and likwid-perfctr, e.g., G for placing HW threads close to the GPU. For example, pinning 10 HWthreads closest to GPU 1:

likwid-pin -C G1:0-9 ./run_app
stdweird commented 3 months ago

i am also interested in being able to support cpu pinning in combination with gpu usage. what is the current best practise wrt likwidpin?

@JanLJL you mentioned likwid-topology, but what is the proper flow a use should follow. i am also interested if likwid pin supports a hierarchy: if parent processes use a gpu, make sure the children are also pinned on cores in the same numa domain.

a very recent issue we had was people running torchrun with python code doing dataloader+train, and dataloaders. the dataload+train is what nvidia-smi reports as using the gpu, the remaining dataloaders are child processes of the train+dataload. torchrun is really crappy in pinning correctly, so we are looking for a way to "help" it. likwidpin would be a good candidate for this, but it's unclear how one woud invoke it

TomTheBear commented 3 months ago

Hello,

Thanks for increasing priority on this feature request.

The current workflow would be to run likwid-topology to get the NUMA node where the GPU is attached to. Then you use likwid-pin -c Mx:y-z (x = NUMA domain ID, y and z for the number of HW threads).

One big question for this feature request is whether likwid-pin should also enforce the application to run on the selected GPU(s). I have not found a portable solution to do that yet. The CUDA_VISIBLE_DEVICES environment variable is fine on exclusive systems but inside e.g. shared-node SLURM jobs each with a GPU, this approach does not work anymore. Each SLURM job gets CUDA_VISIBLE_DEVICES=0 but under the hood, they are using different GPUs. My guess is that it is enforced through cgroups but I havn't found out how by now.

I never tried likwid-pin with PyTorch. There might be some other difficulties coming up (e.g. shepherd processes).

Hierarchies are currently not supported but also not needed. likwid-pin works on single processes, so either this process is using a GPU or not. They would be more interesting for likwid-mpirun where one MPI process could use a GPU while the others not. There is currently no way to do that because likwid-mpirun does not yet support the (I call it) colon syntax: mpirun <global opts> <local opts> <exec> <args1> : <local opts> <exec> <args2> : .... With the colon syntax, hierarchies should be doable.