Open aaronenyeshi opened 5 months ago
@aaronenyeshi Internal ticket has been created to investigate this issue. Thanks!
Hi @aaronenyeshi, as you've noted in https://github.com/pytorch/kineto/pull/926, this is due to roctracer enumerating the CPU as well as the GPU devices. This is by design; roctracer is pulling the node ids provided by the kernel driver as it is the most convenient way to get unique device ids, while hipGetDeviceProperties is simply enumerating the GPUs as its function is to report information for the GPUs. However, this isn't clearly documented, and I can see how these device ids could be expected to match, so we're updating the docs to indicate this. Thanks for bringing this to our attention!
Problem Description
Hi, We are using Roctracer for capturing GPU events via roctracer_record_t and
hcc_cb_properties.buffer_callback_fun = activity_callback;
. However, we've found that events have device_id starting from 2 to 9. When using hipGetDeviceProperties, we can observe that ids starting from 0 to 7.Why is this off by 2? Here is our workaround: https://github.com/pytorch/kineto/pull/925
Our Implementation:
Obtain roctracer_record_t and device_id here: https://github.com/pytorch/kineto/blob/cc24537ac461f08597fab3192e59a3952719d7a2/libkineto/src/RoctracerLogger.cpp#L313
Store as int type: https://github.com/pytorch/kineto/blob/cc24537ac461f08597fab3192e59a3952719d7a2/libkineto/src/RoctracerLogger.h#L179
Matches roctracer activity_record_s: https://github.com/ROCm/roctracer/blob/amd-master/inc/ext/prof_protocol.h#L83
Operating System
CentOS Stream 9
CPU
AMD EPYC 7713
GPU
AMD Instinct MI250
ROCm Version
ROCm 6.0.1
ROCm Component
roctracer
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response