Daniel15 / prometheus-net.SystemMetrics

Export system metrics (CPU usage, disk usage, etc) to Prometheus from your .NET app
MIT License
33 stars 5 forks source link

node_cpu_seconds_total, CPU core at full usage, when its not? #9

Open No0Vad opened 4 months ago

No0Vad commented 4 months ago

I've noticed my Grafana graph started showing one of the CPU cores at 100% (in this case it's CPU 0), but when I login to the server the core is not at 100% usage, it's around 1% when I checked. The server has been up for around 293 days and it's a Windows 2016 Server running SQL. It is a VM with 3 sockets and 2 cores per socket, giving 6 cores in total.

If I use this expression in Prometheus node_cpu_seconds_total{instance=~"sql",mode="idle"}

This is returned

node_cpu_seconds_total{cpu="0", instance="sql", job="SERVER", mode="idle"} = 4220362
node_cpu_seconds_total{cpu="1", instance="sql", job="SERVER", mode="idle"} = 307062.125
node_cpu_seconds_total{cpu="2", instance="sql", job="SERVER", mode="idle"} = 486540.15625
node_cpu_seconds_total{cpu="3", instance="sql", job="SERVER", mode="idle"} = 205526.96875
node_cpu_seconds_total{cpu="4", instance="sql", job="SERVER", mode="idle"} = 4148659.5
node_cpu_seconds_total{cpu="5", instance="sql", job="SERVER", mode="idle"} = 3947150.5

Now if I transform it to how I would get a %-usage from 0-100 (1 - rate(node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m])) * 100

This is returned, CPU 0 is at 100% when it isn't actually.

{cpu="0", instance="sql", job="SERVER", mode="idle"} = 100
{cpu="1", instance="sql", job="SERVER", mode="idle"} = 0.23148148148146586
{cpu="2", instance="sql", job="SERVER", mode="idle"} = 1.4467592592592449
{cpu="3", instance="sql", job="SERVER", mode="idle"} = 0.4340277777777679
{cpu="4", instance="sql", job="SERVER", mode="idle"} = 0.9259259259259078
{cpu="5", instance="sql", job="SERVER", mode="idle"} = 3.703703703703687

And without inverting it to get a %-usage from 0-100. E.g; how Idle it is rate(node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m])

Returns this

{cpu="0", instance="sql", job="SERVER", mode="idle"} = 0
{cpu="1", instance="sql", job="SERVER", mode="idle"} = 0.9959490740740742
{cpu="2", instance="sql", job="SERVER", mode="idle"} = 0.9756944444444445
{cpu="3", instance="sql", job="SERVER", mode="idle"} = 0.9942129629629631
{cpu="4", instance="sql", job="SERVER", mode="idle"} = 0.9953703703703705
{cpu="5", instance="sql", job="SERVER", mode="idle"} = 0.9814814814814816

Just using the numbers from the interval without any transform node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m]

This is returned

node_cpu_seconds_total{cpu="0", instance="sql", job="SERVER", mode="idle"}
4220362 @1708597198.39
4220362 @1708597207.39
4220362 @1708597216.39
4220362 @1708597225.39
4220362 @1708597234.39
4220362 @1708597243.39
4220362 @1708597252.39

node_cpu_seconds_total{cpu="1", instance="sql", job="SERVER", mode="idle"}
307483.71875 @1708597198.39
307492.65625 @1708597207.39
307501.625 @1708597216.39
307510.625 @1708597225.39
307519.59375 @1708597234.39
307528.5625 @1708597243.39
307537.5625 @1708597252.39

node_cpu_seconds_total{cpu="2", instance="sql", job="SERVER", mode="idle"}
486960.46875 @1708597198.39
486969.375 @1708597207.39
486978.03125 @1708597216.39
486986.59375 @1708597225.39
486995.28125 @1708597234.39
487004.03125 @1708597243.39
487013 @1708597252.39

node_cpu_seconds_total{cpu="3", instance="sql", job="SERVER", mode="idle"}
205948.046875 @1708597198.39
205956.984375 @1708597207.39
205965.9375 @1708597216.39
205974.921875 @1708597225.39
205983.875 @1708597234.39
205992.84375 @1708597243.39
206001.8125 @1708597252.39

node_cpu_seconds_total{cpu="4", instance="sql", job="SERVER", mode="idle"}
4149079.75 @1708597198.39
4149088.5 @1708597207.39
4149097.75 @1708597216.39
4149106.5 @1708597225.39
4149115.75 @1708597234.39
4149124.75 @1708597243.39
4149133.5 @1708597252.39

node_cpu_seconds_total{cpu="5", instance="sql", job="SERVER", mode="idle"}
3947558.5 @1708597198.39
3947566.5 @1708597207.39
3947575.75 @1708597216.39
3947584.5 @1708597225.39
3947593.75 @1708597234.39
3947602.5 @1708597243.39
3947611 @1708597252.39

I'm not sure what's causing this, restarting the app that runs SystemMetrics fixes it. But it issue above will return after a while.

I've noticed the seconds on CPU 0 all are 4220362, they've stopped increasing. My guess is when CPU 4 also reaches 4220362 (71283 seconds away, or 19h 48m 03s) the same thing will happen, reporting 100% usage when it's not.

I will add a comment if that is the case.

Daniel15 commented 4 months ago

Interesting. Thanks for the bug report. I'm only using this on Linux servers in production which is likely why I haven't noticed this.

I've noticed the seconds on CPU 0 all are 4220362

Once converted to milliseconds, this looks suspiciously close to the maximum value for a 32-bit unsigned integer. I wonder if there's something wrong with the way it's reading the counters. The code isn't very complex: https://github.com/Daniel15/prometheus-net.SystemMetrics/blob/master/src/Prometheus.SystemMetrics/Collectors/WindowsCpuUsageCollector.cs. I can check what other software does.

Perhaps the counters are not designed for long-term use, and we need to re-create the counters once they're above a certain number. I'm not sure how well Prometheus handles counters cycling back to 0 though.

This is probably not doable on a production system, but if you have a staging/test system you could try leaving perfmon open and tracking the same counters for the same period of time, and see if it hits the same issue. The counters being used are the highlighted ones here: image

No0Vad commented 4 months ago

I have not restarted the app since I noticed this again so I perfmon with the counters you mentioned for almost two minutes. And it looks alright there, there is little load on the server so idle is high.

image

During this time, the app that runs SystemMetrics was still reporting 4220362 for CPU 0.

# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="system"} 9631.453125
node_cpu_seconds_total{cpu="0",mode="irq"} 123.70313262939453
node_cpu_seconds_total{cpu="0",mode="idle"} 4220362
node_cpu_seconds_total{cpu="0",mode="user"} 50854.734375

I'm unsure what will happen in Grafana if the value is reset. Prometheus says this about rate expression.

"Calculates per-second increase of a range of vector (for counters)"

I would guess it would either be treated as a 0% with my expression because of negative number at first (I have minimum of 0 too) before its recovering itself. And if so, I rather have that result than it resulting in 100% until a reboot. šŸ˜…

In 9h 17m 26s I'll know if the same thing happens to CPU 4, stuck at 4220362 resulting in 100% in Grafana.

Daniel15 commented 4 months ago

I think the counter starts when the app starts reading it, so is it possible to leave perfmon open for the same period of time? Open it now, leave it minimized, and see if it's still working 24 hours later?

I have an idea for a workaround. It could create new counters every hour, and store the values reached by the previous counters. Then for each metric it could return sum of old counters + value of new counter. This should keep the values going up rather than resetting to 0, and as long as Prometheus uses 64-bit ints (which I think it does), it'll be fine.

I'll look into doing that soon.

Daniel

On February 22, 2024 1:04:39 PM PST, No0Vad @.***> wrote:

I have not restarted the app since I noticed this again so I perfmon with the counters you mentioned for almost two minutes. And it looks alright there, there is little load on the server so idle is high.

image

During this time, the app that runs SystemMetrics was still reporting 4220362 for CPU 0.

# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="system"} 9631.453125
node_cpu_seconds_total{cpu="0",mode="irq"} 123.70313262939453
node_cpu_seconds_total{cpu="0",mode="idle"} 4220362
node_cpu_seconds_total{cpu="0",mode="user"} 50854.734375

I'm unsure what will happen in Grafana if the value is reset. Prometheus says this about rate expression.

"Calculates per-second increase of a range of vector (for counters)"

I would guess it would either be treated as a 0% with my expression because of negative number at first (I have minimum of 0 too) before its recovering itself. And if so, I rather have that result than it resulting in 100% until a reboot. šŸ˜…

In 9h 17m 26s I'll know if the same thing happens to CPU 4, stuck at 4220362 resulting in 100% in Grafana.

-- Reply to this email directly or view it on GitHub: https://github.com/Daniel15/prometheus-net.SystemMetrics/issues/9#issuecomment-1960311056 You are receiving this because you commented.

Message ID: @.***>

No0Vad commented 4 months ago

I'll let perfmon run for a bit longer.

Prometheus can handle int64, the metric node_network_receive_bytes_total is a int64 for example.

No0Vad commented 4 months ago

I let perfmon run over the night, the graph still looks like my first image.

CPU 4 has surpassed CPU 0 in seconds 4220362, and is increasing for the moment. When executing my expression again, from time to time, there are negative values

(1 - rate(node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m])) * 100

First run

{cpu="0", instance="sql", job="SERVER", mode="idle"} = 100
{cpu="1", instance="sql", job="SERVER", mode="idle"} = 0.2083333333333437
{cpu="2", instance="sql", job="SERVER", mode="idle"} = 0.6944444444444531
{cpu="3", instance="sql", job="SERVER", mode="idle"} = 0.4166666666666763
{cpu="4", instance="sql", job="SERVER", mode="idle"} = 1.1102230246251565e-14
{cpu="5", instance="sql", job="SERVER", mode="idle"} = 6.666666666666677

Second run

{cpu="0", instance="sql", job="SERVER", mode="idle"} = 100
{cpu="1", instance="sql", job="SERVER", mode="idle"} = 0.23148148148146586
{cpu="2", instance="sql", job="SERVER", mode="idle"} = 1.5046296296296169
{cpu="3", instance="sql", job="SERVER", mode="idle"} = 0.4629629629629539
{cpu="4", instance="sql", job="SERVER", mode="idle"} = -2.220446049250313e-14
{cpu="5", instance="sql", job="SERVER", mode="idle"} = -0.4629629629629761

node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m]

node_cpu_seconds_total{cpu="0", instance="sql", job="SERVER", mode="idle"}
4220362 @1708675237.39
4220362 @1708675246.39
4220362 @1708675255.39
4220362 @1708675264.39
4220362 @1708675273.39
4220362 @1708675282.39
4220362 @1708675291.39

node_cpu_seconds_total{cpu="1", instance="sql", job="SERVER", mode="idle"}
385135.84375 @1708675237.39
385144.84375 @1708675246.39
385153.8125 @1708675255.39
385162.78125 @1708675264.39
385171.75 @1708675273.39
385180.71875 @1708675282.39
385189.6875 @1708675291.39

node_cpu_seconds_total{cpu="2", instance="sql", job="SERVER", mode="idle"}
564315.4375 @1708675237.39
564324.375 @1708675246.39
564333.3125 @1708675255.39
564342.25 @1708675264.39
564351.125 @1708675273.39
564360.125 @1708675282.39
564369 @1708675291.39

node_cpu_seconds_total{cpu="3", instance="sql", job="SERVER", mode="idle"}
283615.125 @1708675237.39
283624.09375 @1708675246.39
283633.03125 @1708675255.39
283641.9375 @1708675264.39
283650.875 @1708675273.39
283659.84375 @1708675282.39
283668.78125 @1708675291.39

node_cpu_seconds_total{cpu="4", instance="sql", job="SERVER", mode="idle"}
4226748 @1708675237.39
4226757 @1708675246.39
4226766 @1708675255.39
4226775 @1708675264.39
4226784.5 @1708675273.39
4226793 @1708675282.39
4226802 @1708675291.39

node_cpu_seconds_total{cpu="5", instance="sql", job="SERVER", mode="idle"}
4024321.5 @1708675237.39
4024330.25 @1708675246.39
4024339 @1708675255.39
4024348.25 @1708675264.39
4024357 @1708675273.39
4024366 @1708675282.39
4024374.75 @1708675291.39

I decided to restart the SystemMetrics app now and the expressions now return this

(1 - rate(node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m])) * 100

{cpu="0", instance="sql", job="SERVER", mode="idle"}    -167420.38483796298
{cpu="1", instance="sql", job="SERVER", mode="idle"}    0.9837962962962798
{cpu="2", instance="sql", job="SERVER", mode="idle"}    3.009259259259245
{cpu="3", instance="sql", job="SERVER", mode="idle"}    1.27314814814814
{cpu="4", instance="sql", job="SERVER", mode="idle"}    2.777777777777768
{cpu="5", instance="sql", job="SERVER", mode="idle"}    1.3888888888888729

node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m]

node_cpu_seconds_total{cpu="0", instance="sql", job="SERVER", mode="idle"}
4220362 @1708675417.39
4220362 @1708675426.39
4220362 @1708675435.39
4220362 @1708675444.39
90443.1640625 @1708675453.39
90452.03125 @1708675462.39
90461.0078125 @1708675471.39

node_cpu_seconds_total{cpu="1", instance="sql", job="SERVER", mode="idle"}
385315.21875 @1708675417.39
385324.1875 @1708675426.39
385333.03125 @1708675435.39
385341.96875 @1708675444.39
385350.90625 @1708675453.39
385359.71875 @1708675462.39
385368.6875 @1708675471.39

node_cpu_seconds_total{cpu="2", instance="sql", job="SERVER", mode="idle"}
564494 @1708675417.39
564502.75 @1708675426.39
564511.5 @1708675435.39
564520.25 @1708675444.39
564528.8125 @1708675453.39
564537.6875 @1708675462.39
564546.375 @1708675471.39

node_cpu_seconds_total{cpu="3", instance="sql", job="SERVER", mode="idle"}
283793.8125 @1708675417.39
283802.78125 @1708675426.39
283811.625 @1708675435.39
283820.5 @1708675444.39
283829.4375 @1708675453.39
283838.25 @1708675462.39
283847.125 @1708675471.39

node_cpu_seconds_total{cpu="4", instance="sql", job="SERVER", mode="idle"}
4226927.5 @1708675417.39
4226936.5 @1708675426.39
4226945.5 @1708675435.39
4226953.5 @1708675444.39
4226962.5 @1708675453.39
4226971.5 @1708675462.39
4226980 @1708675471.39

node_cpu_seconds_total{cpu="5", instance="sql", job="SERVER", mode="idle"}
4024496.75 @1708675417.39
4024505.5 @1708675426.39
4024514.5 @1708675435.39
4024523.25 @1708675444.39
4024532 @1708675453.39
4024540.75 @1708675462.39
4024550 @1708675471.39

Note how my expression returns -167420 for CPU 0, my guess for what would happen was correct. And in my graph it looks like 0% image

My expression still sometimes return a negative value, unsure if that's expected

(1 - rate(node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m])) * 100

{cpu="0", instance="sql", job="SERVER", mode="idle"} = 0.3038194444444309
{cpu="1", instance="sql", job="SERVER", mode="idle"} = 0.28935185185183787
{cpu="2", instance="sql", job="SERVER", mode="idle"} = 1.3888888888888729
{cpu="3", instance="sql", job="SERVER", mode="idle"} = 0.7523148148148029
{cpu="4", instance="sql", job="SERVER", mode="idle"} = -2.220446049250313e-14
{cpu="5", instance="sql", job="SERVER", mode="idle"} = 1.8518518518518379

node_cpu_seconds_total{instance=~"sql",mode="idle"}[1m]

node_cpu_seconds_total{cpu="0", instance="sql", job="SERVER", mode="idle"}
90899.359375 @1708675912.39
90908.34375 @1708675921.39
90917.2890625 @1708675930.39
90926.265625 @1708675939.39
90935.2421875 @1708675948.39
90944.21875 @1708675957.39
90953.1953125 @1708675966.39

node_cpu_seconds_total{cpu="1", instance="sql", job="SERVER", mode="idle"}
385807.875 @1708675912.39
385816.84375 @1708675921.39
385825.78125 @1708675930.39
385834.75 @1708675939.39
385843.75 @1708675948.39
385852.71875 @1708675957.39
385861.71875 @1708675966.39

node_cpu_seconds_total{cpu="2", instance="sql", job="SERVER", mode="idle"}
564975.75 @1708675912.39
564984.6875 @1708675921.39
564993.1875 @1708675930.39
565002.1875 @1708675939.39
565011.0625 @1708675948.39
565020.0625 @1708675957.39
565029 @1708675966.39

node_cpu_seconds_total{cpu="3", instance="sql", job="SERVER", mode="idle"}
284284.78125 @1708675912.39
284293.6875 @1708675921.39
284302.625 @1708675930.39
284311.5625 @1708675939.39
284320.5 @1708675948.39
284329.4375 @1708675957.39
284338.375 @1708675966.39

node_cpu_seconds_total{cpu="4", instance="sql", job="SERVER", mode="idle"}
4227419.5 @1708675912.39
4227428.5 @1708675921.39
4227437.5 @1708675930.39
4227446.5 @1708675939.39
4227455.5 @1708675948.39
4227464.5 @1708675957.39
4227473.5 @1708675966.39

node_cpu_seconds_total{cpu="5", instance="sql", job="SERVER", mode="idle"}
4024977.75 @1708675912.39
4024986.75 @1708675921.39
4024995 @1708675930.39
4025003.75 @1708675939.39
4025012.75 @1708675948.39
4025022 @1708675957.39
4025030.75 @1708675966.39
Daniel15 commented 3 months ago

I haven't forgotten about this. I've just been busy with my day job. I'll try to take a look later this week.

No0Vad commented 3 months ago

No worries.

I'm currently testing using a double instead of float on this line https://github.com/Daniel15/prometheus-net.SystemMetrics/blob/master/src/Prometheus.SystemMetrics/Collectors/WindowsCpuUsageCollector.cs#L90

So far it works fine, 2 days and counting.

No0Vad commented 3 months ago

So far it works fine, 2 days and counting.

It looked good for a while, but the same thing happened today, stuck at 100. šŸ˜…

And take your time with this one, your work and life takes priority!