NordicHPC / sonar

Tool to profile usage of HPC resources by regularly probing processes using ps.
GNU General Public License v3.0
8 stars 5 forks source link

Generalize parsing of nvidia-smi pmon output #170

Closed lars-t-hansen closed 1 month ago

lars-t-hansen commented 2 months ago

The format has changed, and unless we want to fix #87 we must handle multiple formats properly.

ml1: NVIDIA System Management Interface -- v545.23.08

$ nvidia-smi pmon -c 1 -s u
# gpu         pid  type    sm    mem    enc    dec    command
# Idx           #   C/G     %      %      %      %    name
    0    1174916     C     88     54      -      -    python         
    0    1186862     C      -      -      -      -    python3        
    1    1174916     C     92     53      -      -    python         
    1    1223470     C      -      -      -      -    python3        
    2    1174916     C     89     53      -      -    python         
    2     941737     C      -      -      -      -    python3        

gpu-13.fox: NVIDIA System Management Interface -- v550.54.14

$ nvidia-smi pmon -c 1 -s u
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0          -     -      -      -      -      -      -      -    -              
    1          -     -      -      -      -      -      -      -    -              
    2          -     -      -      -      -      -      -      -    -              
    3          -     -      -      -      -      -      -      -    -              

It could look like the sensible thing to do here would be to decode the # gpu line and use that as a key into the other data. We could sensibly try to detect issues and signal problems via the gpufail field.