energymon / energymon

A portable interface for energy monitoring utilities
Apache License 2.0
36 stars 18 forks source link

No RAPL zones found! #53

Closed edisonchan closed 1 month ago

edisonchan commented 1 month ago

On a AMD new CPU I got this info: energymon_init_rapl: No RAPL zones found! energymon:finit: No such device source: Intel RAPL exclusive: false interval (usec): 1000 precision (uJ): 0 reading (uJ): 0

but from turbostat: turbostat turbostat version 2024.05.10 - Len Brown lenb@kernel.org Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.9.9-060909-generic root=UUID=99b2cc4b-d814-457a-be83-92e966ce048d ro text amd_pstate=active CPUID(0): AuthenticAMD 0x10 CPUID levels CPUID(1): family:model:stepping 0x1a:44:0 (26:68:0) microcode 0x0 CPUID(0x80000000): max_extended_levels: 0x80000028 CPUID(1): SSE3 MONITOR - - - TSC MSR - HT - CPUID(6): APERF, No-TURBO, No-DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB CPUID(7): No-SGX No-Hybrid cpu0: cpufreq driver: amd-pstate-epp cpu0: cpufreq governor: powersave /dev/cpu_dma_latency: 2000000000 usec (default) current_driver: acpi_idle current_governor: menu current_governor_ro: menu cpu0: POLL: CPUIDLE CORE POLL IDLE cpu0: C1: ACPI FFH MWAIT 0x0 cpu0: C2: ACPI IOPORT 0x414 cpu0: C3: ACPI IOPORT 0x415 RAPL: 234 sec. Joule Counter Range, at 280 Watts cpu0: MSR_RAPL_PWR_UNIT: 0x000a1000 (1.000000 Watts, 0.000015 Joules, 0.000977 sec.) Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IPC IRQ POLL C1 C2 C3 POLL% C1% C2% C3% CorWatt PkgWatt

how can I fix it?

connorimes commented 1 month ago

The energymon-rapl implementation is designed for Intel processors, but may work with some AMD processors. This energymon backend parses the sysfs filesystem at /sys/class/powercap/ looking for intel-rapl nodes.

First, are you running with sudo/root privileges? Unfortunately this is needed to read powercap energy values.

If so and it's still not working, I'll point you to a different tool that's specifically meant for working with powercap (full disclosure: I'm also the developer). If you're using a Debian-based Linux like Ubuntu, just:

sudo apt install powercap-utils

Otherwise you can compile/install from the its repository: https://github.com/powercap/powercap

Then run:

sudo powercap-info

and please share the output here.

edisonchan commented 1 month ago

I have ran sudo chmod u+s /usr/local/bin/energymon-* before run energymon-info.

I have also try powercap-info before, but nothing output after run it.

sudo modprobe intel_rapl_msr sudo rapl-info Zone does not exist Considerations for common errors:

I want measure the energy usage of per bit memory transfer so I need the lib like energymon, turbostat can not use in this case.

connorimes commented 1 month ago

If powercap-info (not just rapl-info) doesn't show any output, then your system hasn't loaded any powercap drivers from which you can read RAPL energy counters. This usually means that either your CPU doesn't support RAPL or the Linux kernel version on your system isn't detecting the RAPL support (it looks like you're using a recent kernel, but you also said your CPU is new).

If you're confident your CPU supports RAPL, you can try to read the MSRs directly. The energymon-msr backend does this, but I caution that (1) it's only for the "package" domain, and (2) MSR configurations can vary across CPUs and the msr energymon backend only implements the default configuration.

edisonchan commented 1 month ago

I got error with the energymon-msr (git master): edison@u24:~/Downloads/energymon/build/msr$ sudo ./energymon-info /dev/cpu/0/msr: Input/output error energymon:finit: Input/output error source: X86 MSR exclusive: false interval (usec): 1000 precision (uJ): 0 reading (uJ): 0

I think that maybe caused by the msr bits?(left: turbostat.c; right: energymon-msr.c) image

connorimes commented 1 month ago

energymon-info only uses the MSR_RAPL_POWER_UNIT and MSR_PKG_ENERGY_STATUS registers, and IIRC you'll get an I/O error like you saw if either MSR doesn't exist. Like I said, that code was written for Intel CPUs and only for the "package" domain. If AMD implements RAPL in different MSRs, the energymon code would need to be modified. If you can determine what turbostat is doing, you can make those changes yourself (I have no way to test them).

FYI you can test reading MSRs directly from the terminal using rdmsr (from the msr-tools Debian package I think).

connorimes commented 1 month ago

I'd be curious if/how you end up getting any RAPL results from your AMD CPU, but AFAICT this isn't an issue in energymon. Cheers.

edisonchan commented 4 weeks ago

I'd be curious if/how you end up getting any RAPL results from your AMD CPU, but AFAICT this isn't an issue in energymon. Cheers.

I can now use the simplest rdmsr method(reference: https://github.com/deater/uarch-configure/tree/master/rapl-read ) to read the energy state of the processor in C code without any other libraries, and because the RAPL MSR of the processor is 64-bit, there is no need for additional code and thread code to wrap around for accumulation.

I'm not very familiar with energymon's MSR implementation(MSR is Intel only?) and can't find a way to modify it.

connorimes commented 4 weeks ago

It looks like that repository has support for (at least some) AMD CPUs. I can see that they specify different MSRs for those chips. Whereas energymon-msr expects Intel CPUs and reads from MSRs 0x606 (for units) and 0x611 (for package energy), it looks like AMD requires MSRs 0xc0010299 and 0xc001029B, respectively. The registers might also use different encodings, so even if you read them you'll need to know how to make sense of the bits. I haven't looked myself.

Also, while I expect the registers on both architectures to be 64 bit, (1) not all the bits are necessarily used for the energy value and (2) the values have some unit of measurement that is not likely to be a standard unit (e.g., Joule or microJoule---hence the need for the power units register). The energy register values can still overflow, but depending on how long your experiments run for, it may not be necessary for you to try and detect this in code as the likelihood if it happening in the middle of your experiments might be exceptionally small, esp. if you're system isn't drawing a lot of power. In contrast, energymon-msr is designed to support long-running experiments on servers that can draw significantly more power and thus the energy counter registers would overflow frequently enough that it was a problem that needed to be addressed.