Closed andy295 closed 1 year ago
Thanks for reporting. I just tried it on our Alex cluster (8 A100 per node) and it works.
First, it does not find the executable:
Failed to execute command: likwid-nvidia/test/triadCU
The error is on the wiki page, the folder likwid-nvidia
is wrong, use only test/triadCU
. It's not mentioned explicitly but you have to compile triadCU
first: make -C test triadCU
.
Moreover, sudo
commonly empties the LD_LIBRARY_PATH
and other environment variables to avoid security issues.
I updated the page (path to triadCU
and the make triadCU
step).
I tried to run without sudo
, but I obtain the following error message :
ERROR - [./src/includes/nvmon_perfworks.h:nvmon_perfworks_addEventSet:1620] Success. Function (*cuptiProfilerGetCounterAvailabilityPtr)(&getCounterAvailabilityParams) failed with error 35 ERROR - [./src/includes/nvmon_perfworks.h:nvmon_perfworks_addEventSet:1620] Success. CUPTI_ERROR_INSUFFICIENT_PRIVILEGES ERROR - [./src/nvmon.c:nvmon_addEventSet:468] Bad address. Failed to add event set for GPU 0
I read the documentation about CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
issue and basically NVIDIA
suggests to use the sudo
command.
I had to manually compile the tiradCU, from inside the test folder, with the following instruction:
nvcc -Xcompiler -fopenmp -O3 -I. -I/usr/local/include -L/usr/local/lib -DLIKWID_NVMON triad.cu -o triadCU -lm -llikwid
The make -C test triadCU
instruction returns the following error message:
/usr/bin/ld: /tmp/tmpxft_00399609_00000000-11_triad.o: in function
main': tmpxft_00399609_00000000-6_triad.cudafe1.cpp:(.text.startup+0x54): undefined reference to
omp_get_thread_num'
The sudo
fix is just for temporal permissions. There are other ways to do it permanently, see Enable access permanently. This should work:
sudo PATH="$PATH" HOME="$HOME" LD_LIBRARY_PATH="$LD_LIBRARY_PATH" likwid-perfctr ...
The Makefile
misses the -Xcompiler -fopenmp
options for the compilation. Not sure why. I changed it in the master branch: https://github.com/RRZE-HPC/likwid/commit/6a2421b6b31d25ae1a2b18a050ded0acbb291675
I had to change the cuda
version from 11.6
to 11.8
, so I removed Likwid
and installed it again with the new path.
If I run the command I obtain the following error message:
CUDA runtime library libcudart.so not found.ERROR - [./src/topology_gpu.c:topology_gpu_init:226] Cannot open CUDA library to fill GPU topology CUDA runtime library libcudart.so not found.ERROR - [./src/topology_gpu.c:topology_gpu_init:226] Cannot open CUDA library to fill GPU topology CPU name: AMD EPYC 7742 64-Core Processor
CPU type: AMD K17 (Zen2) architecture CPU clock: 2.25 GHz /usr/local/bin/likwid-lua: /usr/local/bin/likwid-perfctr:742: attempt to get length of a nil value (global 'gpulist') stack traceback: /usr/local/bin/likwid-perfctr:742: in main chunk [C]: in ?
However, if I look for the file libcudart.so
file with find
I obtain this result:
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudart.so
The CUDA_HOME
is equal to /usr/local/cuda-11.8/
and the LD_LIBRARY_PATH
in equal to /usr/local/cuda-11.8/:/usr/local/cuda-11.8/lib64:/usr/local/cuda-11.1/targets/x86_64-linux/lib
Here below the last part of the Likwid config.mk
file:
CUDAINCLUDE = $(CUDA_HOME)/include CUPTIINCLUDE = $(CUDA_HOME)/extras/CUPTI/include BUILDAPPDAEMON=true
Your LD_LIBRARY_PATH
does not contain the path to libcudart.so
, there is /usr/local/cuda-11.1/...
but it should be /usr/local/cuda-11.8/...
Nvidia sometimes does breaking changes in minor releases but it seems not to be the case for CUDA 11.6 -> 11.8.
I just tested the topology component with CUDA 12.0.1 and 12.1.1 as well and it works.
That was a mistake for sure, thanks for that. I fixed it with the correct path, now I have
LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/targets/x86_64-linux/lib
However, I obtain the same error message.
Is there something else that I can do to check if Likwid
is looking on the right position?
LIKWID itself does not perform the library search. It's all done by dlopen
. This is the code to load libcudart.so
: https://github.com/RRZE-HPC/likwid/blob/master/src/topology_gpu.c#L117 . Unfortunately, I missed adding the output of dlerror()
there, it might give valuable hints now.
You could probably use something like strace
, it should show the list of library opening attempts.
Ok, I changed the PATHs in this way:
export CUDA_HOME=/usr/local/cuda
export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cudau/lib64:$LD_LIBRARY_PATH"
and now that error seems to be gone.
Unfortunately now I have this one here:
CPU name: AMD EPYC 7742 64-Core Processor
CPU type: AMD K17 (Zen2) architecture CPU clock: 2.25 GHz ERROR - [./src/nvmon.c:nvmon_init:204] No such file or directory. Cannot create device 0
Please run it with debug output (-V 3
) and attach the log file (if possible).
What Nvidia GPU(s) do you have in the system?
Here below the output of the likwid-perfctr -V 3 -G 0 -W FLOPS_DP -m test/triadCU
command.
Output.txt
I'm using eight NVIDIA A100
.
In the output, it writes that it cannot find one of the libraries libcupti.so
. Please check where it is located in your CUDA installation and add the path to LD_LIBRARY_PATH
. It's commonly ${CUDA_HOME}/extras/CUPTI/lib
but it might have changed in the recent versions of CUDA.
I found it into /usr/local/cuda/extras/CUPTI/lib64
, I added it into the LD_LIBRAY_PATH
, but nothing has changed.
Here the final part of the output file:
DEBUG - [nvmon_init:182] Device 0 runs with CUPTI Profiling API backend DEBUG - [nvmon_perfworks_createDevice:857] link_perfworks_libraries in createDevice DEBUG - [link_perfworks_libraries:377] LD_LIBRARY_PATH=(null) DEBUG - [link_perfworks_libraries:378] CUDA_HOME=(null) DEBUG - [link_perfworks_libraries:468] CUpti library libcupti.so not found ERROR - [./src/nvmon.c:nvmon_init:204] No such file or directory.
Why LD_LIBRARY_PATH
and CUDA_HOME
and null
?
If I digit env
I can see them:
SHELL=/bin/bash COLORTERM=truecolor TERM_PROGRAM_VERSION=1.81.0 KRB5CCNAME=FILE:/tmp/krb5cc_115698_hXzggd XDG_SESSION_TYPE=tty /bin/6445d93c81ebe42c4cbd7a60712e0b17d9463e97/node MOTD_SHOWN=pam LANG=en_US.UTF-8 VSCODE_GIT_ASKPASS_EXTRA_ARGS= XDG_SESSION_CLASS=user TERM=xterm-256color VSCODE_GIT_IPC_HANDLE=/run/user/115698/vscode-git-a05f8bf795.sock SHLVL=1 XDG_SESSION_ID=96 LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 XDG_RUNTIME_DIR=/run/user/115698 SSH_CLIENT=10.236.252.103 54388 22 CUDA_HOME=/usr/local/cuda XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop PATH=/usr/local/cuda/bin::/opt/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-11.8/bin:/usr/local/MATLAB/R2022a/bin DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/115698/bus TERM_PROGRAM=vscode VSCODE_IPC_HOOKCLI=/run/user/115698/vscode-ipc-18d4053c-45dd-44fb-9cfc-b15c6cf172ff.sock =/usr/bin/env
Are you still using sudo
or have you configured permanent permissions?
Now I have enabled the permanent permission and the command works. Thanks for the help.
Just another question, the ENERGY performance group works with the GPU? And should it work with the same triadCU example?
OK, maybe I should document that the sudo
way has its difficulties.
In general, you can use any event(s) and groups available with the same code. That's the benefit of configuring LIKWID "from the outside": instrument&compile once and measure whatever you want. Unfortunately, there is no ENERGY group for Nvidia GPUs yet. It requires a different library (NVML). There are some code fragments but nothing usable yet.
I added some infos here: https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr#nvidia-gpu-permissions
I'm probably going out of the topic, I'm sorry, but I would like to understand if it's possible to do what I have in mind, otherwise I'm wasting my time.
Basically I'm going to train an RNN model, and by using PyLikwid
I would like to measure the power consumption of specific parts of the training.
If you say that the ENERGY group is not available for the Nvidia GPUs, I assume that I cannot do this kind of experiment. Is my assumption correct?
You learn something, so no time wasted.
Unfortunately, you cannot do that at the moment with LIKWID but it will be generally challenging. As far as I remember, you do not get energy measurements (Joules) from the Nvidia libraries only power (Watt). So you cannot take a sample in the beginning of your training parts and one in the end to calculate power consumption of your kernel. Please check yourself to be sure.
Nsight Perf SDK uses CUPTI or Cupti/PerfWorks under the hood. It seems Nsight Perf is more for the graphics pipelines while NVIDIA Nsight Compute is for computing. Since those are vendor tools, they should provide you with all functionality.
No, the Nvidia NVML library with the power readings is not yet supported.
I installed
LIKWID 5.2.2
on a system that has the Nvidia GPU A100. After the installation I tried some basic instructions, like for examplelikwid-topology
with different parameters and everything seams to work correctly.After that, following the reference page I tried to check if I'm able to collect data from the GPU.
I tried the following instruction:
sudo likwid-perfctr -G 0 -W FLOPS_DP -m likwid-nvidia/test/triadCU
and I obtained the following message error:I'm not sure if the problem is related to the installation or if there is another problem.