Netflix / vector

Vector is an on-host performance monitoring framework which exposes hand picked high resolution metrics to every engineer’s browser.
http://getvector.io/
Apache License 2.0
3.58k stars 252 forks source link

take units metadata into account on utilization calculations #82

Open hassanbabaie opened 9 years ago

hassanbabaie commented 9 years ago

When using Vector to report on Linux machines I'm seeing reasonable CPU Utilization Stats.

However when running against a Windows Server running PCP Glider the CPU Widgets (CPU Utilization and Per-CPU Utilization) are off the charts and running into the thousand of percent e.g cpu0 3150%) other stats like Disk IO look right.

I've checked using the PCP Charts tools on the local Windows Server to see if PCP is reporting bad numbers to Vector but it seems to be reading the utilization correctly?.

FYI, I'm running on VMware ESXi 5 and the host is a MS Windows Server 2003 R2 machine with SP2

Udpate: I'm running v1.0.1 on Apache with the latest Vector distribution tarball on Bintray

natoscott commented 9 years ago

The problem will be that the Windows kernel is exporting these metrics in units of microseconds (PM_TIME_MSEC from pmapi.h - "microsec" in the "units" field of webapi response) whereas the Linux kernel exports them in units of milliseconds. You can check this via "pminfo -f kernel.all.cpu.idle" for example, on each platform.

I think the fix may involve Vector doing scaling based on the "units" field returned in the JSON pmwebd responses?

hassanbabaie commented 9 years ago

Hi Nathan, thanks for the quick reply... Now before I write anymore I should point out I'm a rubbish coder!

I had a look at what you said and yes I see on the Windows Servers the kernel.all.cpu.idle metric is being reported back differently compared to say kernel.all.cpu.sys or kernel.all.cpu.user on the same windows system. However that said, when I look at the code for the widget e.g.cpuUtilizationMetric.datamodel.js I see it pulling .sys, .users and doing the related multiplier along with .ncpu metric but not the .idle one.

I'll keep looking and will have a go at hacking a fix as you mentioned, that is unless someone works out the update before me.

Thanks again

Hass

spiermar commented 9 years ago

Never took that into account since we don't use Windows at all. @natoscott suggesting seems to be right way of doing that. I'll keep issue open, but not sure when I'll have the time to look into it. Accepting pull requests though! :-)