RRZE-HPC / likwid

Performance monitoring and benchmarking suite
https://hpc.fau.de/research/tools/likwid/
GNU General Public License v3.0
1.65k stars 226 forks source link

[BUG] LIKWID_GPUMARKER_CLOSE on example instruction #547

Closed andy295 closed 1 year ago

andy295 commented 1 year ago

I installed LIKWID 5.2.2 on a system that has the Nvidia GPU A100. After the installation I tried some basic instructions, like for example likwid-topology with different parameters and everything seams to work correctly.

After that, following the reference page I tried to check if I'm able to collect data from the GPU.

I tried the following instruction: sudo likwid-perfctr -G 0 -W FLOPS_DP -m likwid-nvidia/test/triadCU and I obtained the following message error:

CPU name: AMD EPYC 7742 64-Core Processor
CPU type: AMD K17 (Zen2) architecture CPU clock: 2.25 GHz Failed to execute command: likwid-nvidia/test/triadCU GPU Marker API result file does not exist. This may happen if the application has not called LIKWID_GPUMARKER_CLOSE.

I'm not sure if the problem is related to the installation or if there is another problem.

TomTheBear commented 1 year ago

Thanks for reporting. I just tried it on our Alex cluster (8 A100 per node) and it works.

First, it does not find the executable:

Failed to execute command: likwid-nvidia/test/triadCU

The error is on the wiki page, the folder likwid-nvidia is wrong, use only test/triadCU. It's not mentioned explicitly but you have to compile triadCU first: make -C test triadCU.

Moreover, sudo commonly empties the LD_LIBRARY_PATH and other environment variables to avoid security issues.

I updated the page (path to triadCU and the make triadCU step).

andy295 commented 1 year ago

I tried to run without sudo, but I obtain the following error message :

ERROR - [./src/includes/nvmon_perfworks.h:nvmon_perfworks_addEventSet:1620] Success. Function (*cuptiProfilerGetCounterAvailabilityPtr)(&getCounterAvailabilityParams) failed with error 35 ERROR - [./src/includes/nvmon_perfworks.h:nvmon_perfworks_addEventSet:1620] Success. CUPTI_ERROR_INSUFFICIENT_PRIVILEGES ERROR - [./src/nvmon.c:nvmon_addEventSet:468] Bad address. Failed to add event set for GPU 0

I read the documentation about CUPTI_ERROR_INSUFFICIENT_PRIVILEGES issue and basically NVIDIA suggests to use the sudo command.

I had to manually compile the tiradCU, from inside the test folder, with the following instruction:

nvcc -Xcompiler -fopenmp -O3 -I. -I/usr/local/include -L/usr/local/lib -DLIKWID_NVMON triad.cu -o triadCU -lm -llikwid

The make -C test triadCU instruction returns the following error message:

/usr/bin/ld: /tmp/tmpxft_00399609_00000000-11_triad.o: in function main': tmpxft_00399609_00000000-6_triad.cudafe1.cpp:(.text.startup+0x54): undefined reference toomp_get_thread_num'

TomTheBear commented 1 year ago

The sudo fix is just for temporal permissions. There are other ways to do it permanently, see Enable access permanently. This should work: sudo PATH="$PATH" HOME="$HOME" LD_LIBRARY_PATH="$LD_LIBRARY_PATH" likwid-perfctr ...

The Makefile misses the -Xcompiler -fopenmp options for the compilation. Not sure why. I changed it in the master branch: https://github.com/RRZE-HPC/likwid/commit/6a2421b6b31d25ae1a2b18a050ded0acbb291675

andy295 commented 1 year ago

I had to change the cuda version from 11.6 to 11.8, so I removed Likwid and installed it again with the new path.

If I run the command I obtain the following error message:

CUDA runtime library libcudart.so not found.ERROR - [./src/topology_gpu.c:topology_gpu_init:226] Cannot open CUDA library to fill GPU topology CUDA runtime library libcudart.so not found.ERROR - [./src/topology_gpu.c:topology_gpu_init:226] Cannot open CUDA library to fill GPU topology CPU name: AMD EPYC 7742 64-Core Processor
CPU type: AMD K17 (Zen2) architecture CPU clock: 2.25 GHz /usr/local/bin/likwid-lua: /usr/local/bin/likwid-perfctr:742: attempt to get length of a nil value (global 'gpulist') stack traceback: /usr/local/bin/likwid-perfctr:742: in main chunk [C]: in ?

However, if I look for the file libcudart.so file with find I obtain this result:

/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so

The CUDA_HOME is equal to /usr/local/cuda-11.8/ and the LD_LIBRARY_PATH in equal to /usr/local/cuda-11.8/:/usr/local/cuda-11.8/lib64:/usr/local/cuda-11.1/targets/x86_64-linux/lib

Here below the last part of the Likwid config.mk file:

CUDAINCLUDE = $(CUDA_HOME)/include CUPTIINCLUDE = $(CUDA_HOME)/extras/CUPTI/include BUILDAPPDAEMON=true

TomTheBear commented 1 year ago

Your LD_LIBRARY_PATH does not contain the path to libcudart.so, there is /usr/local/cuda-11.1/... but it should be /usr/local/cuda-11.8/...

Nvidia sometimes does breaking changes in minor releases but it seems not to be the case for CUDA 11.6 -> 11.8.

TomTheBear commented 1 year ago

I just tested the topology component with CUDA 12.0.1 and 12.1.1 as well and it works.

andy295 commented 1 year ago

That was a mistake for sure, thanks for that. I fixed it with the correct path, now I have

LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/targets/x86_64-linux/lib

However, I obtain the same error message.

Is there something else that I can do to check if Likwid is looking on the right position?

TomTheBear commented 1 year ago

LIKWID itself does not perform the library search. It's all done by dlopen. This is the code to load libcudart.so: https://github.com/RRZE-HPC/likwid/blob/master/src/topology_gpu.c#L117 . Unfortunately, I missed adding the output of dlerror() there, it might give valuable hints now.

You could probably use something like strace, it should show the list of library opening attempts.

andy295 commented 1 year ago

Ok, I changed the PATHs in this way:

export CUDA_HOME=/usr/local/cuda export PATH="/usr/local/cuda/bin:$PATH" export LD_LIBRARY_PATH="/usr/local/cudau/lib64:$LD_LIBRARY_PATH"

and now that error seems to be gone.

Unfortunately now I have this one here:

CPU name: AMD EPYC 7742 64-Core Processor
CPU type: AMD K17 (Zen2) architecture CPU clock: 2.25 GHz ERROR - [./src/nvmon.c:nvmon_init:204] No such file or directory. Cannot create device 0

TomTheBear commented 1 year ago

Please run it with debug output (-V 3) and attach the log file (if possible).

What Nvidia GPU(s) do you have in the system?

andy295 commented 1 year ago

Here below the output of the likwid-perfctr -V 3 -G 0 -W FLOPS_DP -m test/triadCU command. Output.txt

I'm using eight NVIDIA A100.

TomTheBear commented 1 year ago

In the output, it writes that it cannot find one of the libraries libcupti.so. Please check where it is located in your CUDA installation and add the path to LD_LIBRARY_PATH. It's commonly ${CUDA_HOME}/extras/CUPTI/lib but it might have changed in the recent versions of CUDA.

andy295 commented 1 year ago

I found it into /usr/local/cuda/extras/CUPTI/lib64, I added it into the LD_LIBRAY_PATH, but nothing has changed.

Here the final part of the output file:

DEBUG - [nvmon_init:182] Device 0 runs with CUPTI Profiling API backend DEBUG - [nvmon_perfworks_createDevice:857] link_perfworks_libraries in createDevice DEBUG - [link_perfworks_libraries:377] LD_LIBRARY_PATH=(null) DEBUG - [link_perfworks_libraries:378] CUDA_HOME=(null) DEBUG - [link_perfworks_libraries:468] CUpti library libcupti.so not found ERROR - [./src/nvmon.c:nvmon_init:204] No such file or directory.

Why LD_LIBRARY_PATH and CUDA_HOME and null?

If I digit env I can see them:

SHELL=/bin/bash COLORTERM=truecolor TERM_PROGRAM_VERSION=1.81.0 KRB5CCNAME=FILE:/tmp/krb5cc_115698_hXzggd XDG_SESSION_TYPE=tty /bin/6445d93c81ebe42c4cbd7a60712e0b17d9463e97/node MOTD_SHOWN=pam LANG=en_US.UTF-8 VSCODE_GIT_ASKPASS_EXTRA_ARGS= XDG_SESSION_CLASS=user TERM=xterm-256color VSCODE_GIT_IPC_HANDLE=/run/user/115698/vscode-git-a05f8bf795.sock SHLVL=1 XDG_SESSION_ID=96 LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 XDG_RUNTIME_DIR=/run/user/115698 SSH_CLIENT=10.236.252.103 54388 22 CUDA_HOME=/usr/local/cuda XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop PATH=/usr/local/cuda/bin::/opt/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-11.8/bin:/usr/local/MATLAB/R2022a/bin DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/115698/bus TERM_PROGRAM=vscode VSCODE_IPC_HOOKCLI=/run/user/115698/vscode-ipc-18d4053c-45dd-44fb-9cfc-b15c6cf172ff.sock =/usr/bin/env

TomTheBear commented 1 year ago

Are you still using sudo or have you configured permanent permissions?

andy295 commented 1 year ago

Now I have enabled the permanent permission and the command works. Thanks for the help.

Just another question, the ENERGY performance group works with the GPU? And should it work with the same triadCU example?

TomTheBear commented 1 year ago

OK, maybe I should document that the sudo way has its difficulties.

In general, you can use any event(s) and groups available with the same code. That's the benefit of configuring LIKWID "from the outside": instrument&compile once and measure whatever you want. Unfortunately, there is no ENERGY group for Nvidia GPUs yet. It requires a different library (NVML). There are some code fragments but nothing usable yet.

TomTheBear commented 1 year ago

I added some infos here: https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#nvidia-gpu-permissions

andy295 commented 1 year ago

I'm probably going out of the topic, I'm sorry, but I would like to understand if it's possible to do what I have in mind, otherwise I'm wasting my time.

Basically I'm going to train an RNN model, and by using PyLikwid I would like to measure the power consumption of specific parts of the training.

If you say that the ENERGY group is not available for the Nvidia GPUs, I assume that I cannot do this kind of experiment. Is my assumption correct?

TomTheBear commented 1 year ago

You learn something, so no time wasted.

Unfortunately, you cannot do that at the moment with LIKWID but it will be generally challenging. As far as I remember, you do not get energy measurements (Joules) from the Nvidia libraries only power (Watt). So you cannot take a sample in the beginning of your training parts and one in the end to calculate power consumption of your kernel. Please check yourself to be sure.

TomTheBear commented 1 year ago

Nsight Perf SDK uses CUPTI or Cupti/PerfWorks under the hood. It seems Nsight Perf is more for the graphics pipelines while NVIDIA Nsight Compute is for computing. Since those are vendor tools, they should provide you with all functionality.

No, the Nvidia NVML library with the power readings is not yet supported.