Closed MarcelFerrari closed 4 months ago
@MarcelFerrari,
These are some steps that could help us to identify the problem:
Enable debug output in your script
PS4='$LINENO: ' # to see line numbers
set -x # enable debugging
set -v # to see actual lines and not just side effects
Collect debug logs from nv-hostengine:
# Re-run the nv-hostengine process with the following arguments
> sudo nv-hostengine -f host.debug.log --log-level debug
Determine whether a job with failing stats runs on the same GPU or if GPUs differ each time. This can be ascertained from the nv-hostengine debug logs if you gather them for multiple attempts.
It's important to determine if the same GPU ID could potentially be assigned to different DCGMI groups. These groups are essentially just lists, and if this situation occurs, removing statistics from one group will affect all GPUs in that group, regardless of whether you create another group with the same GPU(s). The purpose of these groups is to avoid the need to list all GPUs/entities every time, and they do not have any specific logic attached to them.
@nikkon-dev thank you for the quick reply.
I have done some more testing and found the following.
665: dcgmi group -c job_11941_proc_0
70: dcgmi group -g 292 -a 0,1,2,3
665: dcgmi group -c job_11941_proc_1
70: dcgmi group -g 189 -a 0,1,2,3
71: dcgmi stats -g 189 -e
72: dcgmi stats -g 189 -s job_11941_proc_1 -u 100
71: dcgmi stats -g 292 -e
72: dcgmi stats -g 292 -s job_11941_proc_0 -u 100
90: dcgmi stats -v -j job_11941_proc_0
91: dcgmi stats -r job_11941_proc_0
92: dcgmi group -d 292
90: dcgmi stats -v -j job_11941_proc_1
91: dcgmi stats -r job_11941_proc_1
92: dcgmi group -d 189
We can see that both process create a group named ' job_11941_proc_x' and add GPU ids from 0,1,2,3. These two process sit on different nodes, so in this case there is no way that any GPUs are mistakenly added to different groups.
Im very curious to see what you think.
Thank you very much again
@nikkon-dev there seems to be a problem specifically with the dcgmi utility. I rewrote the program in C++ using the C++ API and it works great. I will close the issue for now, however the CLI tool is still broken.
Hi all,
I am writing a tiny jobreport script based on what is written here. The idea is to give users of our center the ability to gain some insight on the performance of their GPU workloads with minimal efforts.
The script I have right now looks like this:
Users should be able to eg. simply
srun -N xx --ntasks-per-node=yy --gpus-per-task=zz ./jobreport.sh my_workload
.This script works fine and dcgm is able to create one group per process and assign all the necessary GPUs to the process. After to workload is done, the output is also correctly generated.
The problem is that in some situations the statistics are completely wrong. For example when running with 4 processes per node and 1 GPU per process, some nodes will return the correct statistics, while others will simply return 0.
Eg. after running a simple python script that runs a matrix multiplication on each GPU I get:
For processes 0-3 sitting on node 1:
while for processes 4-7 sitting on node 2:
which is clearly wrong.
I should specify that I did check with nvitop and ensured that a process is running on all GPUs of node 2.
Also I am not using MIG devices.
If I run the job using 4 procs only on node 2, I get the same wrong results. If I decrease the number of procs on node 2 to 1, I get correct results again.
What could be causing this? The same script is running on different nodes but I am getting completely different results.
Is there anything obviously wrong with my approach?
Thanks in advance