IANW-Projects / toolkitICL

toolkitICL: An open source tool for automated OpenCL kernel execution and profiling.
8 stars 3 forks source link

Enable Choice of NVML Device #38

Open ranocha opened 5 years ago

ranocha commented 5 years ago

Up to now, the NVML device is hard coded in https://github.com/IANW-Projects/toolkitICL/blob/1160e7711819c0e249fb58a2a9fb24d91e8eec5e/src/main.cpp#L452 and https://github.com/IANW-Projects/toolkitICL/blob/1160e7711819c0e249fb58a2a9fb24d91e8eec5e/src/main.cpp#L481

We should add some command line option to choose another device or even other devices. A somewhat simple option would be to allow logging on only one device. However, I would prefer the ability to enable logging on an arbitrary number of devices.

As described at here, it would be better to use nvmlDeviceGetHandleByUUID or nvmlDeviceGetHandleByPciBusId.

CC @Kostaszki

ranocha commented 5 years ago

Another option would be to log the power/temperature of all devices, similar to the approach for Intel (packages 0 and 1).

Kostaszki commented 5 years ago

When logging the power/temperature of all device you still need an option to correlate the used OpenCL device with the NVML device. Considering this I would prefer the command line option.

ranocha commented 5 years ago

In that case, one possibility might be to query the UUID via

$ nvidia-smi -L
GPU 0: GeForce GTX 1070 Ti (UUID: GPU-7350c62a-efab-c59a-a51f-f99f19ccbf6b)

Then, we can have the general calling syntax toolkitICL -d 0 -nvidia_power 100 [optional uuids] -c config.h5.

  1. If no UUID is specified, we can/should log all devices. If we consider the current behavior as a bug, that's okay for a new release. Otherwise, we would have to go to version 2.0.0 if we change this behavior.
  2. If at least one UUID is specified, the devices having these UUIDs should be used for logging.

We can use names such as power0, power1 to enumerate the devices (in the order used by nvml in case 1 or in the given order in case 2). The UUID (and possibly other data) could be added to the description.

ranocha commented 5 years ago

In #39, @philipheinisch implemented a sensible default value for nvml. Maybe we want to enable additional logging of specificlly chosen devices for a more general power logging library?