ROCm / roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
https://rocm.docs.amd.com/projects/roctracer/en/latest/
Other
69 stars 30 forks source link

[Issue]: roctracer_record_t returned device_id are off by 2. Devices are enumerated 2 to 9 instead of 0 to 7. #98

Open aaronenyeshi opened 5 months ago

aaronenyeshi commented 5 months ago

Problem Description

Hi, We are using Roctracer for capturing GPU events via roctracer_record_t and hcc_cb_properties.buffer_callback_fun = activity_callback;. However, we've found that events have device_id starting from 2 to 9. When using hipGetDeviceProperties, we can observe that ids starting from 0 to 7.

Why is this off by 2? Here is our workaround: https://github.com/pytorch/kineto/pull/925

Our Implementation:

Obtain roctracer_record_t and device_id here: https://github.com/pytorch/kineto/blob/cc24537ac461f08597fab3192e59a3952719d7a2/libkineto/src/RoctracerLogger.cpp#L313

Store as int type: https://github.com/pytorch/kineto/blob/cc24537ac461f08597fab3192e59a3952719d7a2/libkineto/src/RoctracerLogger.h#L179

Matches roctracer activity_record_s: https://github.com/ROCm/roctracer/blob/amd-master/inc/ext/prof_protocol.h#L83

Operating System

CentOS Stream 9

CPU

AMD EPYC 7713

GPU

AMD Instinct MI250

ROCm Version

ROCm 6.0.1

ROCm Component

roctracer

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd commented 1 month ago

@aaronenyeshi Internal ticket has been created to investigate this issue. Thanks!

schung-amd commented 1 month ago

Hi @aaronenyeshi, as you've noted in https://github.com/pytorch/kineto/pull/926, this is due to roctracer enumerating the CPU as well as the GPU devices. This is by design; roctracer is pulling the node ids provided by the kernel driver as it is the most convenient way to get unique device ids, while hipGetDeviceProperties is simply enumerating the GPUs as its function is to report information for the GPUs. However, this isn't clearly documented, and I can see how these device ids could be expected to match, so we're updating the docs to indicate this. Thanks for bringing this to our attention!