elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.13k stars 4.91k forks source link

[metricbeat] Enhancement: expanded CPU monitoring #11848

Open fearful-symmetry opened 5 years ago

fearful-symmetry commented 5 years ago

First of all, I noticed nothing is monitoring CPU thermal throttles, available under /sys/devices/system/cpu/cpu%d/thermal_throttle This measures temperature-instigated package/core throttles and could be very useful for bare-metal deployments.

My other ideas are a tad more...far out there. I utilized both of these in some way or another at my last job, so they're not new for me.

New-ish intel CPUs have a feature called RAPL that can be used to gather power usage across 4 domains: the package, core, uncore (stuff that's not the core, integrated graphics, etc), and DRAM. This is how the Intel Power Gadget works. We can use this to gather relatively detailed and granular power usage for bare-metal hosts.

I'd also like to bring up the possibility of reading from other MSRs that could be useful for the sake of monitoring. MCA is the most obvious and the most interesting, as it provides info on low-level CPU hardware errors. It (obviously) overlaps with mcalog, and I'm not sure if filebeat or something else could already be integrating with that. There's also a A32_THERM_STATUS MSR, which can be used to set and get the status of thermal thresholds.

exekias commented 5 years ago

This sounds useful. I understand that thermal throttle counters make sense in the existing system cpu metricset and they sound as a quick win.

I'm wondering if Filebeat system module is capturing these events (from syslog/dmesg)

fearful-symmetry commented 5 years ago

I understand that thermal throttle counters make sense in the existing system cpu metricset

Yep, that was my thinking as well.

I'm wondering if Filebeat system module is capturing these events (from syslog/dmesg)

you mean mcalog? I would assume so.

exekias commented 5 years ago

tbh, I don't know enough about MCAlog :innocent: , I was referring to dmesg thermal events, like:

[196583.288213] CPU2: Core temperature above threshold, cpu clock throttled (total events = 487797)

In any case, it's good to have both events (from Filebeat) and metrics (from Metricbeat)

fearful-symmetry commented 5 years ago

Ahhh, yah, I would imagine a lot of that ends up in dmesg.

RayKishev commented 4 years ago

@fearful-symmetry I have been looking into this for a few days now. Is there any way to get CPU temp information in KIbana?

fearful-symmetry commented 4 years ago

@RayKishev I don't think we have anything for temp data now, but there might be something buried in the system metricset somewhere.

RayKishev commented 4 years ago

@fearful-symmetry Thank you for your reply. I think i know what might be a solution. I have created script which writes CPU temp metrics and logs into a file, and creating a log file under /var/log. The script will be running with cron job. Now how can we create indexes for that temp logs i have generated.

jsoriano commented 4 years ago

CPU temperature monitoring has been requested in discuss at least a couple of times:

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

svanschooten commented 1 year ago

Any updates on this issue?