Open tpatki opened 3 months ago
@tpatki How does prev_delta=false
handle counter rollover? Are we guaranteed to have a thread that's sampling in the background often enough to detect that?
Hi @rountree That's a great question, and it will vary by the underlying architecture. See details below.
(I am hoping these notes will also help @dbo understand the challenges at our end and why supporting this will take some time.)
On IBM systems, there are no MSRs/counters, so there is no issue of rollover. The hardware does not report energy directly, so we are already sampling instantaneous power using the OPAL file system interface every 250ms
at the moment. On a related note, my recent experiments with Caliper tell me we should sample faster than that, IBM recommends 100ms and up. Will create an issue and update this sampling rate soon.
On GPUs (NVIDIA, AMD and Intel), energy values are reported directly from the underlying APIs (NVML, RSMI, APMI). See here as an example. These report energy values since the GPU driver was last loaded, typically. GPU vendors do not expose the counters directly to us, so we don't have to deal with overflows ourselves and can rely on their APIs. Currently, Variorum v0. 8 does not support GPU energy reporting -- mostly due to lack of time/resources at our end. We have some WIP PRs on this, see #559 and #563.
On AMD CPUs (Milan and up), AMD provides ESMI
library and the amd_energy
kernel module (they also support msr-safe
👍 ). Here too, we do not interact with MSRs directly and rely on AMD's open source ESMI APIs for the processor, e.g. esmi_socket_energy_get
, which are part of the port for Variorum that AMD contributed.
That leaves us with our most complex scenario on Intel CPUs. Here, we are directly using low-level MSR interfaces. We already have the infrastructure for MSR_PKG_ENERGY_STATUS
and MSR_DRAM_ENERGY_STATUS
in Variorum that we utilize to print_energy
(and the same for JSON APIs). This is where I believe we will need to add a new thread, like we do with IBM, and sample often enough, say every 50ms. I could be wrong though, and we may be able to do this without adding explicit sampling in a separate thread. This support will be the greatest lift at our end but should be do-able.
I haven't looked at ARM as we haven't added support for print/get energy APIs there yet.
Let me know if I answered that in enough detail, happy to have a meeting next week to discuss.
@slabasan @rountree Checking in if you had more questions or feedback here. If not, I can take a stab at a PR so we can test this out in Caliper (and also add better support for our Kokkos users).
@tpatki
So yes, I'd prefer to have the general case be sampling occasionally unless the vendor documentation we have makes rollover a once-per-decade thing. But I'm not implementing this, so it's just a preference.
Thanks @rountree.
Given that we have limited resources for Variorum at the moment, at least for the first cut at this, I am going to lean toward telling the user that we are passing along data that the vendor libraries are providing us (ESMI, RSMI, NVML, etc) and trusting that these vendor APIs take care of rollovers.
On some architectures (e.g. all GPUs), the low-level registers are not accessible at all, and we have no choice but to trust that APIs such as nvmlDeviceGetTotalEnergyConsumption
will do the right thing -- which I believe they do. GPU APIs report energy values based on when the driver was last loaded -- see here, so they would have to take care of any rollovers (although we've not explicitly tested this).
Intel CPUs are the only exception to this situation, where we read directly from the MSR_PKG_ENERGY_STATUS
or MSR_DRAM_PKG_STATUS
in our code. These have 32-bits of energy data and are updated every millisecond, resulting in a wraparound every few minutes.
Looking at our port, I realized that we are already taking care of wraparounds for these registers in the Intel port when we calculate deltas, as we need to do this for reporting power on these systems too. Take a look here.
My understanding is that we will be reporting the correct values for energy with the current Intel port if we chose to do deltas (no sampling will be needed if I am understanding the code correctly, but I haven't refreshed my memory on this port enough yet). I will have to test this explicitly when I start working on this PR. I believe @slabasan has tested these wraparounds before, she may be able to comment as well.
TLDR: Let's try to get a first cut at this while trusting the vendor APIs (and our Intel port). Let's document this well and explain to the users this decision. And let's leave an issue open to test for rollovers on each architecture, so we can fix these if we run into them or if any users run into them.
@tpatki Sounds good.
Update the API to support nested calls in general, especially in Caliper-like tools. This might be useful for the Kokkos-update as well.
Merge #559 and #563 first, and then add a new flag to the API.
Current suggestion:
variorum_get_energy_json(char** s)
will be updated tovariorum_get_energy_json(char** s, bool prev_delta)
. Settingprev_delta
totrue
will return the accumulated energy since theprevious
call to thevariorum_get_energy_json
function from the application/tool's context. Setting this tofalse
will return the accumulated energy since thefirst
call to thevariorum_get_energy_json
.I will work on an initial WIP PR as soon as I can, hoping to get this merged in by end of August. Happy to take any feedback and suggestions on this.