Support a choice between delta from the first call versus previous call in `variorum_get_energy_json`

tpatki commented 3 months ago

Update the API to support nested calls in general, especially in Caliper-like tools. This might be useful for the Kokkos-update as well.

Merge #559 and #563 first, and then add a new flag to the API.

Current suggestion: variorum_get_energy_json(char** s) will be updated to variorum_get_energy_json(char** s, bool prev_delta). Setting prev_delta to true will return the accumulated energy since the previous call to the variorum_get_energy_json function from the application/tool's context. Setting this to false will return the accumulated energy since the first call to the variorum_get_energy_json.

@dbo: Creating this issue to track our discussion.
@masterleinad: Tagging you so you are aware of this upcoming change, I might need your help with testing on Intel GPUs as we don't have access to them at our end.
@tjeter @rountree Keeping you in the loop with this discussion as it is relevant to some of your research.

I will work on an initial WIP PR as soon as I can, hoping to get this merged in by end of August. Happy to take any feedback and suggestions on this.

rountree commented 3 months ago

@tpatki How does prev_delta=false handle counter rollover? Are we guaranteed to have a thread that's sampling in the background often enough to detect that?

tpatki commented 3 months ago

Hi @rountree That's a great question, and it will vary by the underlying architecture. See details below.

(I am hoping these notes will also help @dbo understand the challenges at our end and why supporting this will take some time.)

On IBM systems, there are no MSRs/counters, so there is no issue of rollover. The hardware does not report energy directly, so we are already sampling instantaneous power using the OPAL file system interface every 250ms at the moment. On a related note, my recent experiments with Caliper tell me we should sample faster than that, IBM recommends 100ms and up. Will create an issue and update this sampling rate soon.
On GPUs (NVIDIA, AMD and Intel), energy values are reported directly from the underlying APIs (NVML, RSMI, APMI). See here as an example. These report energy values since the GPU driver was last loaded, typically. GPU vendors do not expose the counters directly to us, so we don't have to deal with overflows ourselves and can rely on their APIs. Currently, Variorum v0. 8 does not support GPU energy reporting -- mostly due to lack of time/resources at our end. We have some WIP PRs on this, see #559 and #563.
On AMD CPUs (Milan and up), AMD provides ESMI library and the amd_energy kernel module (they also support msr-safe 👍 ). Here too, we do not interact with MSRs directly and rely on AMD's open source ESMI APIs for the processor, e.g. esmi_socket_energy_get, which are part of the port for Variorum that AMD contributed.
That leaves us with our most complex scenario on Intel CPUs. Here, we are directly using low-level MSR interfaces. We already have the infrastructure for MSR_PKG_ENERGY_STATUS and MSR_DRAM_ENERGY_STATUS in Variorum that we utilize to print_energy (and the same for JSON APIs). This is where I believe we will need to add a new thread, like we do with IBM, and sample often enough, say every 50ms. I could be wrong though, and we may be able to do this without adding explicit sampling in a separate thread. This support will be the greatest lift at our end but should be do-able.
I haven't looked at ARM as we haven't added support for print/get energy APIs there yet.

Let me know if I answered that in enough detail, happy to have a meeting next week to discuss.

tpatki commented 3 months ago

@slabasan @rountree Checking in if you had more questions or feedback here. If not, I can take a stab at a PR so we can test this out in Caliper (and also add better support for our Kokkos users).

rountree commented 3 months ago

@tpatki

IBM samples instantaneous power, no rollover.
GPUs we use vendor API to get energy, but this doesn't isolate us from rollovers. We can tell the user that we're just passing along whatever value the device gives us, but if we have to do better for Intel, we might as well make that a general solution.
AMD same thing.

So yes, I'd prefer to have the general case be sampling occasionally unless the vendor documentation we have makes rollover a once-per-decade thing. But I'm not implementing this, so it's just a preference.

tpatki commented 3 months ago

Thanks @rountree.

Given that we have limited resources for Variorum at the moment, at least for the first cut at this, I am going to lean toward telling the user that we are passing along data that the vendor libraries are providing us (ESMI, RSMI, NVML, etc) and trusting that these vendor APIs take care of rollovers.

On some architectures (e.g. all GPUs), the low-level registers are not accessible at all, and we have no choice but to trust that APIs such as nvmlDeviceGetTotalEnergyConsumption will do the right thing -- which I believe they do. GPU APIs report energy values based on when the driver was last loaded -- see here, so they would have to take care of any rollovers (although we've not explicitly tested this).

Intel CPUs are the only exception to this situation, where we read directly from the MSR_PKG_ENERGY_STATUS or MSR_DRAM_PKG_STATUS in our code. These have 32-bits of energy data and are updated every millisecond, resulting in a wraparound every few minutes.

Looking at our port, I realized that we are already taking care of wraparounds for these registers in the Intel port when we calculate deltas, as we need to do this for reporting power on these systems too. Take a look here.

My understanding is that we will be reporting the correct values for energy with the current Intel port if we chose to do deltas (no sampling will be needed if I am understanding the code correctly, but I haven't refreshed my memory on this port enough yet). I will have to test this explicitly when I start working on this PR. I believe @slabasan has tested these wraparounds before, she may be able to comment as well.

TLDR: Let's try to get a first cut at this while trusting the vendor APIs (and our Intel port). Let's document this well and explain to the users this decision. And let's leave an issue open to test for rollovers on each architecture, so we can fix these if we run into them or if any users run into them.

rountree commented 3 months ago

@tpatki Sounds good.

LLNL / variorum

Support a choice between delta from the first call versus previous call in `variorum_get_energy_json` #575