Closed gcongiu closed 1 year ago
Adding reproducer: issue_103_reproducer.txt build:
gcc -o test issue_103_reproducer.c -I$ROCM_ROOT/include -I$ROCM_ROOT/include/hsa -I$ROCM_ROOT/include/rocprofiler -ldl -L$ROCM_ROOT/lib -lhsa-runtime64
run:
ROCP_LIB_PATH=$ROCM_ROOT/lib ./test
I encountered the same issue and did some debugging. The problem seems to occur here. GetGpuAgentInfo
fails with the supplied agent index and the callback is never called.
The numbering for GPUs was changed in this commit. With the new numbering GetGpuAgentInfo fails since it still uses the old numbering. The PrintGpuAgents
function might be affected by this too.
I was able to patch the issue with this:
#include<algorithm>
...
bool HsaRsrcFactory::GetGpuAgentInfo(uint32_t idx, const AgentInfo** agent_info) {
// Iterate over Gpu list
auto it = std::find_if(gpu_list_.cbegin(), gpu_list_.cend(), [idx](const AgentInfo* agent) {
// Return first agent with matching id
return agent->dev_index == idx;
});
if(it != gpu_list_.cend()) {
*agent_info = *it;
return true;
} else {
return false;
}
}
This, however breaks AMDs CLI since they rely on passing NULL
as an agent (link) which defaults to index 0. With the new numbering, an agent with the device index 0 may not be present.
I think the following might be the cause of the misbehave:
74ecd34d uint32_t driver_node_id;
74ecd34d status = hsa_api_.hsa_agent_get_info(
74ecd34d agent,
74ecd34d static_cast<hsa_agent_info_t>(HSA_AMD_AGENT_INFO_DRIVER_NODE_ID),
74ecd34d &driver_node_id);
74ecd34d CHECK_STATUS("hsa_agent_get_info(gpu hsa_driver_node_id)", status);
74ecd34d agent_info->dev_index = driver_node_id;
I did a git blame and searched for code that was introduced in November last year. I have not confirmed this is the root cause though.
[EDIT] I reverted the commit above and that fixes the problem.
Hi @gcongiu and @sebinbash, sorry y'all are encountering these issues. Our research tool omnitrace was broken by this bug too (https://github.com/AMDResearch/omnitrace/issues/268) so I'll make sure it gets taken care of and I'll get back to y'all with an update.
Thanks for all the helpful research.
This issue was solved in rocm-5.5.0 RC5
PAPI uses
rocprofiler_iterate_info
to enumerate events from different agents in the system. This function seems to be broken since version 5.4.0 of rocprofiler.