ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
116 stars 49 forks source link

Fix [Not supported] status for get_compute_process_info_by_pid #155

Closed vstempen closed 3 weeks ago

vstempen commented 8 months ago

On some systems [rocm-smi --showpids] reports get_compute_process_info_by_pid, Not supported on the given system [PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN

get_compute_process_info_by_pid fails because cu_occupancy debugfs method is not provided on some graphics cards and GFX revisions by design

Proposing a change to return success status when only cu_occupancy debugfs method is not found and provide cu_occupancy invalidation value to mark only this parameter as UNKNOWN

dmitrii-galantsev commented 8 months ago

Thanks for the change @vstempen !

Just FYI - all our changes go through internal gerrit and then get published to github. Github PRs are OK but might be less visible.

dmitrii-galantsev commented 7 months ago

Merged internally, should make it up to develop branch in the next day. @bill-shuzhou-liu is asking: "is this only applied to cu, or also applied to sdma and vram?"

ppanchad-amd commented 4 months ago

@dmitrii-galantsev Is this fix available in latest ROCm 6.1.1? Thanks!

dmitrii-galantsev commented 2 months ago

merged in 677433b @ppanchad-amd Not sure. Please get rocm-smi version with rocm-smi --version and see if the commit is ahead of the one linked above.

yx-lamini commented 1 month ago

Still see this error on rocm 6.2

tcgu-amd commented 1 month ago

@yx-lamini would you be able to provide more details regarding your system configuration so we can reproduce the issue? Thanks!

yx-lamini commented 1 month ago

@yx-lamini would you be able to provide more details regarding your system configuration so we can reproduce the issue? Thanks!

Yes, of cuz. What do you need? I am running rocm-smi on a mi300 8GPU server with the vanilla rocm 6.2.0 runtime installed.

tcgu-amd commented 1 month ago

@yx-lamini I saw your comment here https://github.com/ROCm/ROCm/issues/2595. Is the problem you are experiencing related to that issue? (If so, I will close this PR and track the problem on the other issue). Thanks!

yx-lamini commented 1 month ago

@yx-lamini I saw your comment here ROCm/ROCm#2595. Is the problem you are experiencing related to that issue? (If so, I will close this PR and track the problem on the other issue). Thanks!

Yes, that works. Sorry for spamming between multiple places.