ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
132 stars 49 forks source link

rocprofiler_iterate_info broken #103

Closed gcongiu closed 1 year ago

gcongiu commented 1 year ago

PAPI uses rocprofiler_iterate_info to enumerate events from different agents in the system. This function seems to be broken since version 5.4.0 of rocprofiler.

gcongiu commented 1 year ago

Adding reproducer: issue_103_reproducer.txt build:

gcc -o test issue_103_reproducer.c -I$ROCM_ROOT/include -I$ROCM_ROOT/include/hsa -I$ROCM_ROOT/include/rocprofiler -ldl -L$ROCM_ROOT/lib -lhsa-runtime64

run:

ROCP_LIB_PATH=$ROCM_ROOT/lib ./test
sebinbash commented 1 year ago

I encountered the same issue and did some debugging. The problem seems to occur here. GetGpuAgentInfo fails with the supplied agent index and the callback is never called.

The numbering for GPUs was changed in this commit. With the new numbering GetGpuAgentInfo fails since it still uses the old numbering. The PrintGpuAgents function might be affected by this too.

I was able to patch the issue with this:

#include<algorithm>

...

bool HsaRsrcFactory::GetGpuAgentInfo(uint32_t idx, const AgentInfo** agent_info) {
  // Iterate over Gpu list
  auto it = std::find_if(gpu_list_.cbegin(), gpu_list_.cend(), [idx](const AgentInfo* agent) {
        // Return first agent with matching id
        return agent->dev_index == idx;
  });

  if(it != gpu_list_.cend()) {
    *agent_info = *it;
    return true;
  } else {
    return false;
  }
}

This, however breaks AMDs CLI since they rely on passing NULL as an agent (link) which defaults to index 0. With the new numbering, an agent with the device index 0 may not be present.

gcongiu commented 1 year ago

I think the following might be the cause of the misbehave:

74ecd34d      uint32_t driver_node_id;
74ecd34d      status = hsa_api_.hsa_agent_get_info(
74ecd34d              agent,
74ecd34d              static_cast<hsa_agent_info_t>(HSA_AMD_AGENT_INFO_DRIVER_NODE_ID),
74ecd34d              &driver_node_id);
74ecd34d      CHECK_STATUS("hsa_agent_get_info(gpu hsa_driver_node_id)", status);
74ecd34d      agent_info->dev_index = driver_node_id;

I did a git blame and searched for code that was introduced in November last year. I have not confirmed this is the root cause though.

[EDIT] I reverted the commit above and that fixes the problem.

jrmadsen commented 1 year ago

Hi @gcongiu and @sebinbash, sorry y'all are encountering these issues. Our research tool omnitrace was broken by this bug too (https://github.com/AMDResearch/omnitrace/issues/268) so I'll make sure it gets taken care of and I'll get back to y'all with an update.

Thanks for all the helpful research.

gcongiu commented 1 year ago

This issue was solved in rocm-5.5.0 RC5