Open adrianco opened 5 months ago
Scaphandre has some discussion and a TODO for GPU measurement https://github.com/hubblo-org/scaphandre/issues/24
NVIDIA data is available for later model and datacenter class GPUs, not for some desktop models. This data source is reported as available for NVIDIA based cloud instances on AWS. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g7ef7dff0ff14238d08a19ad7fb23fc87
The data is milliwatts averaged over a one second interval as an integer.
Kepler currently support NVIDIA GPU (through both nvml and dcgm) and is also working on Intel Gaudi GPU support.
We have a recent tutorial of using Kepler to measure LLM energy consumption and evaluating sustainability in terms of token/watts
As @rootfs mentioned, in the Kepler, we collect data on both the GPU utilization of processes and the total GPU power consumption using the NVML library. Then, we distribute the total GPU power consumption among all processes utilizing the GPU based on their utilization. In Multi-Instance GPU (MIG) scenarios, the calculation method varies a lillte bit. Kepler uses the DCMI metrics to determine MIG slice utilization and distribute the total GPU power accordingly among the MIG slices.
Hello! I've stumbled on this issue from the Scaphandre repository.
After trying to extend Scaphandre to support GPUs, I eventually started "from scratch" and designed a new measurement tool (though Alumet is not "just" a tool for measuring energy consumption). As the Kepler team mentioned, NVML can report the energy consumption of most NVIDIA GPUs, as well as information on the GPU utilization by different processes, and it works quite well. It's better to measure than to rely on TDP-based estimations anyway. IMO that should be enough to start building some models :)
Link to Alumet added to the Miro - It appears that NVIDIA power monitoring is well understood. Next step is to figure out the interfaces for Intel, AMD, Google TPU and AWS Inferentia etc.
Outline Action Item Details
We have a reasonable handle on CPU energy use by taking CPU utilization and mapping it to an energy curve driven by the Thermal Design Power (TDP) of a package - which is sometimes the only public data that is available. GPUs are becoming more common, have a higher TDP than CPUs, but we don't have an easy or standard way to measure the utilization of the GPUs in a system. Propose to reach out to contacts at NVIDIA to see if we can find some answers and encourage them to join GSF.
Issue dependency with other WGs Groups
No response