Open ranocha opened 5 years ago
Another option would be to log the power/temperature of all devices, similar to the approach for Intel (packages 0 and 1).
When logging the power/temperature of all device you still need an option to correlate the used OpenCL device with the NVML device. Considering this I would prefer the command line option.
In that case, one possibility might be to query the UUID via
$ nvidia-smi -L
GPU 0: GeForce GTX 1070 Ti (UUID: GPU-7350c62a-efab-c59a-a51f-f99f19ccbf6b)
Then, we can have the general calling syntax toolkitICL -d 0 -nvidia_power 100 [optional uuids] -c config.h5
.
2.0.0
if we change this behavior.We can use names such as power0
, power1
to enumerate the devices (in the order used by nvml in case 1 or in the given order in case 2). The UUID (and possibly other data) could be added to the description.
In #39, @philipheinisch implemented a sensible default value for nvml. Maybe we want to enable additional logging of specificlly chosen devices for a more general power logging library?
Up to now, the NVML device is hard coded in https://github.com/IANW-Projects/toolkitICL/blob/1160e7711819c0e249fb58a2a9fb24d91e8eec5e/src/main.cpp#L452 and https://github.com/IANW-Projects/toolkitICL/blob/1160e7711819c0e249fb58a2a9fb24d91e8eec5e/src/main.cpp#L481
We should add some command line option to choose another device or even other devices. A somewhat simple option would be to allow logging on only one device. However, I would prefer the ability to enable logging on an arbitrary number of devices.
As described at here, it would be better to use
nvmlDeviceGetHandleByUUID
ornvmlDeviceGetHandleByPciBusId
.CC @Kostaszki