ROCm / ROC-smi

ROC System Management Interface
https://github.com/RadeonOpenCompute/ROC-smi/blob/master/README.md
177 stars 56 forks source link

The dmesg Trigger "DQM create queue failed" on rocm-smi by ROCm 1.8.192 #38

Closed heero-yuy closed 6 years ago

heero-yuy commented 6 years ago

Hi,

These days I found an issue about the consistly executing the rocm-smi by "watch -n 2 /opt/rocm/bin/rocm-smi" , the dmesg will report "DQM create queue failed" as screenshot attached, but the older version ROCm 1.8.1 didn't have this, may someone help to check what problem on ROCm-smi? Thanks!

dqm_create_queue_failed

kentrussell commented 6 years ago

Considering you're running it every 2 seconds, you're generating 8 kernel queues to get the information (most of which aren't changing), and if there are too many queues created at once, they can't all be processed, since the kernel is also trying to run whatever you're running, which would result in those queues being unable to be created. Do you need to get all fields from the SMI, or only the temperature and power usage? If the latter, it would make more sense to just do a watch on those sysfs files instead.

If you use the latest SMI with the 1.7 kernel, does the issue persist? If so, it would be a kernel change, and not an SMI change.

Regardless, we can't guarantee results when you're using the SMI in a way that it was never intended to be used. It is not designed to be an efficient system monitor, it's designed to be a user-friendly way to take a snapshot look at the system information, and to make changes to power settings. For what you're looking for, a simple "watch -n 2 /sys/class/drm/card0/device/<whatever you're trying to monitor> would be sufficient.

kentrussell commented 6 years ago

Closing since it's a kernel issue, and caused by a misuse of the tool

heero-yuy commented 5 years ago

After checked that the ROCm 1.8.2 under ubuntu 16.04.5 with kernel version 4.15.0-34 hasn't triggered DQM Queues event, thanks!