ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

Mapping between HIP device ID and rocm_smi #122

Closed al42and closed 7 months ago

al42and commented 1 year ago

I have a HIP app that uses hipSetDevice and related API to do its things. It might be run with ROCR_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES set.

For a given HIP device, I want to query some info using rocm_smi_lib, e.g., rsmi_topo_get_numa_node_number.

What is the recommended way to map between a HIP device and a ROCm SMI device index?

Manually looping over results of rsmi_dev_pci_id_get for all devices and comparing with hipDeviceProp_t::pciBusID and friends seems like a possible solution, but I wonder if there's an easier / official way.

charis-poag-amd commented 7 months ago

What is the recommended way to map between a HIP device and a ROCm SMI device index?

Manually looping over results of rsmi_dev_pci_id_get for all devices and comparing with hipDeviceProp_t::pciBusID and friends seems like a possible solution, but I wonder if there's an easier / official way.

Comparing PCI IDs decent way. RVS (ROCm validation suite) does something similar - except uses hipDeviceProp_t::pciBusID & hipDeviceProp_t::pciDeviceID:

  // get GPU device properties
  hipDeviceProp_t props;
  hipGetDeviceProperties(&props, hip_index);
  uint16_t hip_dev_location_id =
    ((((uint16_t) (props.pciBusID)) << 8) | (((uint16_t)(props.pciDeviceID)) << 3));

See this patch in RVS ROCm 6.0 for an example. But you're right hipDeviceProp_t::pciBusID is a great way to start. RVS just wants to validate these are actually they same full PCIe BDF (Bus Device Function). Just don't forget about the device and (sometimes could include) the function part too. Depending which part of the the physical device you are looking at.

Full PCIe path is BUS ID:DEVICE ID.Function.

Can double check in linux by doing readlink -f /sys/class/drm/card*/device/ or using lspci.

Hope this helps.