Open jlgreathouse opened 3 years ago
On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties.
On the most recent kernels we expose the Target version in the sysfs topology to eliminate the lookup in user mode. rocm_agent_enumerator could use that as the first choice when it's available. Look for gfx_target_version in /sys/class/kfd/kfd/topology/nodes/*/properties.
Glad you caught this -- I didn't know this had made it into KFD. None of my test systems had it when I wrote the other patches.
I just pushed a further patch to this PR that uses the KFD topology as the primary desired method for finding gfx arch. Fallback to lspci, and then further fallback to rocminfo.
rocm_agent_enumerator
currently callsrocminfo
to find what gfx architectures are available on the current system. This is used by, for instance, compilers that want to query what to natively build for if they are not provided with a gfxarch target.However,
rocminfo
is a very heavyweight method of getting the gfxarch. It queries a large amount of HSA topology information, and opens up/dev/kfd
for various querying purposes. This can make builds slow, as each large, slow query to simply get the gfxarch takes a long time.In addition, it's possible to do a large number of parallel builds (e.g.
make -j
, even when targeting the number of processors on large server systems)./dev/kfd
has a limited number of concurrent users, meaning that it can quickly exhaust its resources. This can lead to incorrect compilations, because no gfxarch would be returned fromrocminfo
.rocm_agent_enumerator
is supposed to have a fallback path whenrocminfo
finds no GPUs. It useslspci
to find AMD GPU device numbers, then looks them up to a hard-coded table. However, this table is woefully out of date, and the call tolspci
is broken anyway. Sorocm_agent_enumerator
would simply fail to return a gfxarch isrocminfo
failed to return that gfxarch.This patchset:
lspci
so that it actually works.lspci
to the dependency list so that we don't end up shipping Docker containers that don't include proper tools.lspci
first, and only fall back to the heavyweightrocminfo
is our PCI ID list falls out of date.