ROCm / rocminfo

ROCm Application for Reporting System Info
Other
35 stars 32 forks source link

Update rocm_agent_enumerator to better handle numerous parallel usages #47

Open jlgreathouse opened 3 years ago

jlgreathouse commented 3 years ago

rocm_agent_enumerator currently calls rocminfo to find what gfx architectures are available on the current system. This is used by, for instance, compilers that want to query what to natively build for if they are not provided with a gfxarch target.

However, rocminfo is a very heavyweight method of getting the gfxarch. It queries a large amount of HSA topology information, and opens up /dev/kfd for various querying purposes. This can make builds slow, as each large, slow query to simply get the gfxarch takes a long time.

In addition, it's possible to do a large number of parallel builds (e.g. make -j, even when targeting the number of processors on large server systems). /dev/kfd has a limited number of concurrent users, meaning that it can quickly exhaust its resources. This can lead to incorrect compilations, because no gfxarch would be returned from rocminfo.

rocm_agent_enumerator is supposed to have a fallback path when rocminfo finds no GPUs. It uses lspci to find AMD GPU device numbers, then looks them up to a hard-coded table. However, this table is woefully out of date, and the call to lspci is broken anyway. So rocm_agent_enumerator would simply fail to return a gfxarch is rocminfo failed to return that gfxarch.

This patchset:

  1. Fixes the woefully out-of-date PCI ID table. It also fixes the call to lspci so that it actually works.
  2. Adds lspci to the dependency list so that we don't end up shipping Docker containers that don't include proper tools.
  3. Switches the order of device queries so that we call the lower-weight lspci first, and only fall back to the heavyweight rocminfo is our PCI ID list falls out of date.
fxkamd commented 3 years ago

On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties.

jlgreathouse commented 3 years ago

On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties.

Glad you caught this -- I didn't know this had made it into KFD. None of my test systems had it when I wrote the other patches.

I just pushed a further patch to this PR that uses the KFD topology as the primary desired method for finding gfx arch. Fallback to lspci, and then further fallback to rocminfo.