ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

How weights and hops are calculated #108

Closed vitduck closed 8 months ago

vitduck commented 2 years ago

Hi,

Considering the following output:

======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
       GPU0         GPU1         GPU2         GPU3         
GPU0   0            52           52           52           
GPU1   52           0            52           52           
GPU2   52           52           0            52           
GPU3   52           52           52           0            

============================ Hops between two GPUs =============================
       GPU0         GPU1         GPU2         GPU3         
GPU0   0            3            3            3            
GPU1   3            0            3            3            
GPU2   3            3            0            3            
GPU3   3            3            3            0            

========================== Link Type between two GPUs ==========================
       GPU0         GPU1         GPU2         GPU3         
GPU0   0            PCIE         PCIE         PCIE         
GPU1   PCIE         0            PCIE         PCIE         
GPU2   PCIE         PCIE         0            PCIE         
GPU3   PCIE         PCIE         PCIE         0            

================================== Numa Nodes ==================================
GPU[0]      : (Topology) Numa Node: 0
GPU[0]      : (Topology) Numa Affinity: 0
GPU[1]      : (Topology) Numa Node: 1
GPU[1]      : (Topology) Numa Affinity: 1
GPU[2]      : (Topology) Numa Node: 3
GPU[2]      : (Topology) Numa Affinity: 3
GPU[3]      : (Topology) Numa Node: 2
GPU[3]      : (Topology) Numa Affinity: 2
============================= End of ROCm SMI Log ==============================

I could not find relevant documents explaining how hops and weights are calculated between AMD GPUs. From the source code, it seems that these are summation of intrinsically assigned values for a specific HW.

In case of NVIDIA, there is a clear hierarchical of topology connect: NVLINK -> PIX -> PXB -> PHB -> NODE -> SYS -> X So I can deduce the spatial relationship between GPUs from nvidia-smi

From rocm-smi it is not immediately clear to me how to interpret the aforementioned weights and hops. Some clarifications are much appreciated.

charis-poag-amd commented 8 months ago

Topology hops: XGMI/PCIE: GPU -> CPU(0->N) -> GPU(0->N) https://rocm.docs.amd.com/en/latest/how_to/tuning_guides/mi200.html#hardware-verification-with-rocm has a good overview.

The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.

The second block has a matrix named “Hops between two GPUs”, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.

Also you could dig into how KFD creates these weights, but its essentially what the documents state.

Hope this helps!