ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
179 stars 55 forks source link

Added stats showing periodically #46

Closed Avatat closed 5 years ago

Avatat commented 6 years ago

I added --repeat SECONDS argument, to print updated stats every X seconds. Effect:

/home/bzieba/rocm-smi --repeat 1 -P

====================    ROCm System Management Interface    ====================
================================================================================
GPU[0]      : Average GPU Power: 12.151 W
================================================================================
GPU[0]      : Average GPU Power: 12.151 W
================================================================================
GPU[0]      : Average GPU Power: 12.151 W
================================================================================
GPU[0]      : Average GPU Power: 12.151 W
^C====================           End of ROCm SMI Log          ====================
kentrussell commented 5 years ago

I would prefer against this for a few reasons. Firstly, you can use "watch -n 1 rocm-smi -P" to get the same result. Secondly, and far more critically, if this gets merged, it's basically saying that I support using the watch/repeat functionality, which some users have wanted, but which has also caused a bunch of issues from people using the SMI improperly. The SMI was never and is never designed to be an active monitoring tool. It is designed to provide information as needed to a user via the command line. By using watch/repeat, this often results in a lot of unnecessary overhead, because the SMI was never designed to be optimally executed repeatedly. It's designed as a one-and-done, or maybe once every 5-10 seconds at the most frequent. (See https://github.com/RadeonOpenCompute/ROC-smi/issues/38 , https://github.com/RadeonOpenCompute/ROC-smi/issues/31 , plus the number of e-mails I get internally about trying to watch the SMI and the kernel issues that it causes)

While the SMI could be optimized, that was never its intent. If all you want is to watch an attribute via the SMI and don't want to watch the actual sysfs file, use "watch -n 1" . If you want to watch more than 1 thing, then you're creating a lot of kernel queues to poll the SMU, which is really not efficient, and will likely cause issues as seen above.

I appreciate the patch, but I fear that merging it is only going to cause a number of people to complain that their kernel is hanging since every "rocm-smi" call with no arguments creates 8 separate kernel queues to query 8 separate SMU registers to return 8 separate values. I am not officially supporting the "watch -n" option either, but users can use that if they want to.

TL;DR: I don't want to support constant polling of the SMU for things, when a Linux program exists for it and is available in all distros, and when the SMI was never meant to be used that way.

Avatat commented 5 years ago

@kentrussell, thank you for explaining that. I know watch command, but many tools have built-in auto-refresh at the specified interval (nvidia-smi, zfs, iostat, and many many others).

SMI design and queues/query problem are a good argument for avoiding periodic runs of rocm-smi, thanks!