NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Incorrect values reported by dcgm stats #171

Closed MarcelFerrari closed 4 months ago

MarcelFerrari commented 4 months ago

Hi all,

I am writing a tiny jobreport script based on what is written here. The idea is to give users of our center the ability to gain some insight on the performance of their GPU workloads with minimal efforts.

The script I have right now looks like this:

#!/bin/bash
jobreport() {
    # Arguments to jobreport: all arguments passed to the function
    local workload_cmd=("$@")

    verbose=false
    # Check if JOBREPORT_VERBOSE is set to true
    if [ ! -z "$JOBREPORT_VERBOSE" ]; then
        verbose=true
    fi

    # Check if 'set -e' is enabled
    # We need to disable errexit to ensure that the script continues
    # to run even if the workload fails
    errexit_was_set=false
    if [[ $- == *e* ]]; then
        errexit_was_set=true

        if $verbose; then
            echo "Disabling errexit"
        fi

        set +e
    fi

    # Get SLURM variables
    job_id=$SLURM_JOB_ID
    proc_id=$SLURM_PROCID

    # Specify the output directory
    # Check if JOB_REPORT_DIR is set
    if [ ! -z "$JOBREPORT_DIR" ]; then
        output_dir=$(realpath "$JOB_REPORT_DIR")
        mkdir -p "$output_dir"
    else
        mkdir -p "job_report_${job_id}"
        output_dir=$(realpath "job_report_${job_id}")  
    fi

    output_file="${output_dir}/job_report_${job_id}_${proc_id}.out"

    # Determine the GPU group name and which GPUs to allocate
    group_name="job_${job_id}_proc_${proc_id}"

    # Get the GPU IDs
    if [ -z "$SLURM_STEP_GPUS" ]; then
        gpu_ids=$((proc_id % 4))
    else
        gpu_ids=$SLURM_STEP_GPUS
    fi

    # Echo variables
    if $verbose; then
        echo "job_id: $job_id"
        echo "proc_id: $proc_id"
        echo "group_name: $group_name"
        echo "gpu_ids: $gpu_ids"
    fi

    # Create group and add GPUs to the group
    group=$(dcgmi group -c $group_name)
    groupid=$(echo $group | awk '{print $10}')

    # START PROFILER
    if [ $? -eq 0 ]; then
        dcgmi group -g $groupid -a $gpu_ids
        dcgmi stats -g $groupid -e
        dcgmi stats -g $groupid -s $group_name -u 100
    else
        echo "Failed to create DCGM group or add GPUs to the group" >&2
        # Re-enable errexit if it was set before
        if $errexit_was_set; then
            set -e
        fi
        return 1
    fi

    # Execute the workload command - if it fails, continue to the next command
    "${workload_cmd[@]}"
    local workload_exit_code=$?

    # STOP PROFILER and collect stats
    # dcgmi stats -x $SLURM_JOB_ID
    echo "Process ID: ${proc_id}" > "${output_file}"
    dcgmi stats -x $group_name
    dcgmi stats -v -j $group_name | awk 'NR > 1 { print $0 }' >> "${output_file}"
    dcgmi stats -r $group_name
    dcgmi group -d $groupid

    # Re-enable errexit if it was set before
    if $errexit_was_set; then
        set -e
    fi

    return $workload_exit_code
}

# Call the function with all passed arguments
jobreport "$@"

Users should be able to eg. simply srun -N xx --ntasks-per-node=yy --gpus-per-task=zz ./jobreport.sh my_workload.

This script works fine and dcgm is able to create one group per process and assign all the necessary GPUs to the process. After to workload is done, the output is also correctly generated.

The problem is that in some situations the statistics are completely wrong. For example when running with 4 processes per node and 1 GPU per process, some nodes will return the correct statistics, while others will simply return 0.

Eg. after running a simple python script that runs a matrix multiplication on each GPU I get:

For processes 0-3 sitting on node 1:

Process ID: 1
+------------------------------------------------------------------------------+
| GPU ID: 1                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Thu May 16 15:19:35 2024                |
| End Time                           | Thu May 16 15:19:56 2024                |
| Total Execution Time (sec)         | 20.48                                   |
| No. of Processes                   | 1                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | 0                                       |
| Power Usage (Watts)                | Avg: 247.541, Max: 247.541, Min: 247... |
| Max GPU Memory Used (bytes)        | 9934209024                              |
| SM Clock (MHz)                     | Avg: 1305, Max: 1305, Min: 1305         |
| Memory Clock (MHz)                 | Avg: 877, Max: 877, Min: 877            |
| SM Utilization (%)                 | Avg: 100, Max: 100, Min: 100            |
| Memory Utilization (%)             | Avg: 83, Max: 83, Min: 83               |
| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | 0                                       |
| PCIe Replay Warnings               | 0                                       |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | 0                                       |
|        - Thermal (%)               | 0                                       |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | 0                                       |
+--  Compute Process Utilization  ---+-----------------------------------------+
| PID                                | 1837868                                 |
|     Avg SM Utilization (%)         | 100                                     |
|     Avg Memory Utilization (%)     | 79                                      |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

while for processes 4-7 sitting on node 2:

Process ID: 4
+------------------------------------------------------------------------------+
| GPU ID: 0                                                                    |
+====================================+=========================================+
|-----  Execution Stats  ------------+-----------------------------------------|
| Start Time                         | Thu May 16 15:19:35 2024                |
| End Time                           | Thu May 16 15:19:57 2024                |
| Total Execution Time (sec)         | 21.33                                   |
| No. of Processes                   | 0                                       |
+-----  Performance Stats  ----------+-----------------------------------------+
| Energy Consumed (Joules)           | Not Specified                           |
| Power Usage (Watts)                | Avg: N/A, Max: N/A, Min: N/A            |
| Max GPU Memory Used (bytes)        | 0                                       |
| SM Clock (MHz)                     | Avg: 0, Max: 0, Min: 0                  |
| Memory Clock (MHz)                 | Avg: 0, Max: 0, Min: 0                  |
| SM Utilization (%)                 | Avg: 0, Max: 0, Min: 0                  |
| Memory Utilization (%)             | Avg: 0, Max: 0, Min: 0                  |
| PCIe Rx Bandwidth (megabytes)      | Avg: 0, Max: 0, Min: 0                  |
| PCIe Tx Bandwidth (megabytes)      | Avg: 0, Max: 0, Min: 0                  |
+-----  Event Stats  ----------------+-----------------------------------------+
| Single Bit ECC Errors              | 0                                       |
| Double Bit ECC Errors              | Not Specified                           |
| PCIe Replay Warnings               | Not Specified                           |
| Critical XID Errors                | 0                                       |
+-----  Slowdown Stats  -------------+-----------------------------------------+
| Due to - Power (%)                 | Not Supported                           |
|        - Thermal (%)               | Not Supported                           |
|        - Reliability (%)           | Not Supported                           |
|        - Board Limit (%)           | Not Supported                           |
|        - Low Utilization (%)       | Not Supported                           |
|        - Sync Boost (%)            | Not Specified                           |
+-----  Overall Health  -------------+-----------------------------------------+
| Overall Health                     | Healthy                                 |
+------------------------------------+-----------------------------------------+

which is clearly wrong.

I should specify that I did check with nvitop and ensured that a process is running on all GPUs of node 2.

Also I am not using MIG devices.

If I run the job using 4 procs only on node 2, I get the same wrong results. If I decrease the number of procs on node 2 to 1, I get correct results again.

What could be causing this? The same script is running on different nodes but I am getting completely different results.

Is there anything obviously wrong with my approach?

Thanks in advance

nikkon-dev commented 4 months ago

@MarcelFerrari,

These are some steps that could help us to identify the problem:

  1. Enable debug output in your script

    PS4='$LINENO: ' # to see line numbers
    set -x # enable debugging
    set -v # to see actual lines and not just side effects
  2. Collect debug logs from nv-hostengine:

    # Re-run the nv-hostengine process with the following arguments
    > sudo nv-hostengine -f host.debug.log --log-level debug
  3. Determine whether a job with failing stats runs on the same GPU or if GPUs differ each time. This can be ascertained from the nv-hostengine debug logs if you gather them for multiple attempts.

  4. It's important to determine if the same GPU ID could potentially be assigned to different DCGMI groups. These groups are essentially just lists, and if this situation occurs, removing statistics from one group will affect all GPUs in that group, regardless of whether you create another group with the same GPU(s). The purpose of these groups is to avoid the need to list all GPUs/entities every time, and they do not have any specific logic attached to them.

MarcelFerrari commented 4 months ago

@nikkon-dev thank you for the quick reply.

I have done some more testing and found the following.

  1. The GPUs that are failing are always the same. I know this because I am testing this script on a cluster with only 2 GPU nodes which I am allocating manually.
  2. I tried running the same job report script today and now it fails on all nodes instead of only on node 2. The only thing I did was remove some lingering gpu groups from some old manual tests. It also fails for any number of processes and gpus per process configurations.
  3. Using the DCGMReader class from the Python API, I am able to record metrics without any problem so metric collection seems to work fine.
  4. I have collected some logs like you suggested. Here are the relevant dcgmi calls for an example of a job running 4 GPUs per process with 1 process per node:
    665: dcgmi group -c job_11941_proc_0
    70: dcgmi group -g 292 -a 0,1,2,3
    665: dcgmi group -c job_11941_proc_1
    70: dcgmi group -g 189 -a 0,1,2,3
    71: dcgmi stats -g 189 -e
    72: dcgmi stats -g 189 -s job_11941_proc_1 -u 100
    71: dcgmi stats -g 292 -e
    72: dcgmi stats -g 292 -s job_11941_proc_0 -u 100
    90: dcgmi stats -v -j job_11941_proc_0
    91: dcgmi stats -r job_11941_proc_0
    92: dcgmi group -d 292
    90: dcgmi stats -v -j job_11941_proc_1
    91: dcgmi stats -r job_11941_proc_1
    92: dcgmi group -d 189

    We can see that both process create a group named ' job_11941_proc_x' and add GPU ids from 0,1,2,3. These two process sit on different nodes, so in this case there is no way that any GPUs are mistakenly added to different groups.

Im very curious to see what you think.

Thank you very much again

MarcelFerrari commented 4 months ago

@nikkon-dev there seems to be a problem specifically with the dcgmi utility. I rewrote the program in C++ using the C++ API and it works great. I will close the issue for now, however the CLI tool is still broken.