Green-Software-Foundation / real-time-cloud

Other
49 stars 1 forks source link

How should GPU energy use be estimated? #37

Open adrianco opened 5 months ago

adrianco commented 5 months ago

Outline Action Item Details

We have a reasonable handle on CPU energy use by taking CPU utilization and mapping it to an energy curve driven by the Thermal Design Power (TDP) of a package - which is sometimes the only public data that is available. GPUs are becoming more common, have a higher TDP than CPUs, but we don't have an easy or standard way to measure the utilization of the GPUs in a system. Propose to reach out to contacts at NVIDIA to see if we can find some answers and encourage them to join GSF.

Issue dependency with other WGs Groups

No response

adrianco commented 5 months ago

Scaphandre has some discussion and a TODO for GPU measurement https://github.com/hubblo-org/scaphandre/issues/24

NVIDIA data is available for later model and datacenter class GPUs, not for some desktop models. This data source is reported as available for NVIDIA based cloud instances on AWS. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g7ef7dff0ff14238d08a19ad7fb23fc87

The data is milliwatts averaged over a one second interval as an integer.

rootfs commented 4 months ago

Kepler currently support NVIDIA GPU (through both nvml and dcgm) and is also working on Intel Gaudi GPU support.

We have a recent tutorial of using Kepler to measure LLM energy consumption and evaluating sustainability in terms of token/watts

marceloamaral commented 4 months ago

As @rootfs mentioned, in the Kepler, we collect data on both the GPU utilization of processes and the total GPU power consumption using the NVML library. Then, we distribute the total GPU power consumption among all processes utilizing the GPU based on their utilization. In Multi-Instance GPU (MIG) scenarios, the calculation method varies a lillte bit. Kepler uses the DCMI metrics to determine MIG slice utilization and distribute the total GPU power accordingly among the MIG slices.

TheElectronWill commented 4 months ago

Hello! I've stumbled on this issue from the Scaphandre repository.

After trying to extend Scaphandre to support GPUs, I eventually started "from scratch" and designed a new measurement tool (though Alumet is not "just" a tool for measuring energy consumption). As the Kepler team mentioned, NVML can report the energy consumption of most NVIDIA GPUs, as well as information on the GPU utilization by different processes, and it works quite well. It's better to measure than to rely on TDP-based estimations anyway. IMO that should be enough to start building some models :)

adrianco commented 2 months ago

Link to Alumet added to the Miro - It appears that NVIDIA power monitoring is well understood. Next step is to figure out the interfaces for Intel, AMD, Google TPU and AWS Inferentia etc.