Instrumental / instrumentald

Instrumental System and Service Daemon
MIT License
14 stars 3 forks source link

Updated system metric names to be prefixed with 'system.'. #20

Closed jason-o-matic closed 8 years ago

jason-o-matic commented 8 years ago

This overrides the "measurement" name for the system type metrics, and adds a new tag to the output template. This allows us to retain what was the old measurement name in it's old place by setting it explicitly using the new tag.

So for system metrics we have:

system.<host>.<explicitly_set_tag_thats_that_same_as_measurement>.<tags>.<field>

or:

system.localhost.cpu.cpu-totals.usage_user

For non system metrics:

<measurement>.<host>.<tag_that_isnt_set_so_disappears>.<tags>.<field>

or:

mysql.db-001.connections
jqr commented 8 years ago

👍

janxious commented 8 years ago

I think the expanded metric names are a good idea, but also 👍 does what it says on the box

janxious commented 8 years ago

Not sure why we're getting these kinds of things, but seeing weird metrics strings:

gauge system.hel_local.disk.hfs.-.total 499055067136 1469037390
gauge system.hel_local.disk.hfs.-.used 289301020672 1469037390
gauge system.hel_local.disk.hfs.-.used_percent 58.0002255952456 1469037390
gauge system.hel_local.disk.hfs.-.free 209491902464 1469037390
gauge system.hel_local.disk.hfs.-.inodes_free 51145484 1469037390
gauge system.hel_local.disk.hfs.-.inodes_total 121839614 1469037390
gauge system.hel_local.disk.hfs.-.inodes_used 70694130 1469037390
janxious commented 8 years ago

For cpu-total stats, woudl it make sense to drop the cpu-total part from the metric names? The metrics look like system.hel_local.cpu.cpu-total.<thing>

mediocretes commented 8 years ago

Re: cpu-total, I think we should drop it if possible, unless we're planning to deliver per-cpu stats, which seems like too much.

janxious commented 8 years ago

Not sure why we're getting these kinds of things, but seeing weird metrics strings

Figured this one out. It is trying to tag the partition with the data /, which w don't accept in our metric names.

janxious commented 8 years ago

As a PR later, I think it would reasonable to pull in configuration to disable system metrics entirely, and also to do some limiting for the network interfaces we report and disk mounts. My MBP, for example, reports like 10 NICs, even though I've only ever used two of them, and only one (the wifi) in my day-to-day life.

jqr commented 8 years ago

I realize it's outside the scope of this PR, but per-cpu stats would be awesome.

mediocretes commented 8 years ago

Well, if we're planning to go to per cpu stats, maybe we should leave cpu-total in, for namespacing purposes, even if it seems cumbersome now. I do not have a strong opinion on to which way to go on this one.

jason-o-matic commented 8 years ago

I definitely want per CPU stats at some point. Here's what total and per-cpu look like right now:

gauge charlie_test7.cpu.cpu-total.usage_user 4.8487878030492375 1469468425
gauge charlie_test7.cpu.cpu7.usage_user 0.4 1469468425

This means that if you enable both total and per-cpu then ts_sum(charlie_test7.cpu.*.usage_user) won't give you what you might expect because it would double count. You could use regexp negative matching to get what you want but that seems hard.

If we removed cpu-total from the total metrics then charlie_test7.cpu.? would be totals and charlie_test7.cpu.?.? would be per-cpu. I could also see renaming cpu-total to total-cpu or something so that it was easier to discriminate using .cpu*. and .total*.

What do you guys think?

tonydewan commented 8 years ago

Does it makes sense to drop cpu_total altogether since you can get it with the query language?

jason-o-matic commented 8 years ago

I've merged this since "it does what it says on the box" as Joel aptly put it, and it completes the todo.

I do think we should do something with cpu-total though, and am happy to discuss it further.

Does it makes sense to drop cpu_total altogether since you can get it with the query language?

This might depend on how per-cpu stats are reported. I haven't tested, but if each CPU is at 100% it might total to 800%, whereas cpu-total may max out at 100%.

Even if it is computable, however, I think I lean towards keeping cpu-total because it's probably more generally useful than per-cpu. Most of the time I've been satisfied with cpu-total and there have only been a few times where I wanted to know if one process was CPU limited. I'm guessing a little here, maybe once we try per-cpu I'll thinks it's way better all of the time and we can get rid of total.

janxious commented 8 years ago

I would suggest changing cpu-total -> total.

Then the matcher is easy ts_sum(charlie_test7.cpu.cpu?.usage_user). Also it makes a lot more sense to me?

jason-o-matic commented 8 years ago

I would suggest changing cpu-total -> total.

Then the matcher is easy ts_sum(charlie_test7.cpu.cpu?.usage_user). Also it makes a lot more sense to me?

I like that.

jason-o-matic commented 8 years ago

Before we made the call I wanted to make sure it was an easy change, and turns out it was, if we want to do it: https://github.com/Instrumental/instrumentald/commit/82576772df71