Closed vitduck closed 8 months ago
Topology hops: XGMI/PCIE: GPU -> CPU(0->N) -> GPU(0->N) https://rocm.docs.amd.com/en/latest/how_to/tuning_guides/mi200.html#hardware-verification-with-rocm has a good overview.
The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.
The second block has a matrix named “Hops between two GPUs”, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.
Also you could dig into how KFD creates these weights, but its essentially what the documents state.
Hope this helps!
Hi,
Considering the following output:
I could not find relevant documents explaining how hops and weights are calculated between AMD GPUs. From the source code, it seems that these are summation of intrinsically assigned values for a specific HW.
In case of NVIDIA, there is a clear hierarchical of topology connect: NVLINK -> PIX -> PXB -> PHB -> NODE -> SYS -> X So I can deduce the spatial relationship between GPUs from
nvidia-smi
From
rocm-smi
it is not immediately clear to me how to interpret the aforementioned weights and hops. Some clarifications are much appreciated.