How to best handle the ups.status metric

DRuggeri commented 3 years ago

Continuing the conversation in #2 Said @sshaikh:

So... it turns out that "status" isn't a status, but a set of flags (at least for the USB driver - I'd assume it's the case elsewhere too):

https://github.com/networkupstools/nut/blob/2b4a105038723da0f93859029b665f44e6dc860b/drivers/usbhid-ups.c#L182

And the Nut clients know this:

https://github.com/networkupstools/nut/blob/master/clients/upsmon.c

So, yeh. Not sure what the best approach is. Technically the go nut client should handle these as separate flags too, but it might just be easier to set the status based on the string somehow.

I'll send this upstream to see what they think.

Wow... thanks for the research. That's far more complicated than I had anticipated it to be.

I'm not sure what the most idiomatic way to handle this in prometheus would be. I could envision this being handled a few ways:

Example1:

network_ups_tools_ups_status{OL="true", TRIM="true"} 1

I'm concerned this may be lead to 'metric explosion' due to shifts in cardinality as statuses change... especially since there can be an arbitrary number of combined statuses.

Example2:

network_ups_tools_ups_status{OL="true"} 1
network_ups_tools_ups_status{TRIM="true"} 1

This feels quite a bit better... and is the way I'm leaning at the moment.

The issue to examine How should we handle the changing of values? Idiomatically, prometheus would have us generate these metrics on the fly and just report on what's set. But... if status changes from OL TRIM to OB, what happens to the metric network_ups_tools_ups_status{OL="true"}? I think it will be marked stale and a query for network_ups_tools_ups_status{OL="true"} > 0 will immediately return nothing. This needs to be verified, because if a subsequent scrape returns all three conditions (OL="true", TRIM="true", and OB="true"), then we'd have to wait for the stale metric to be dropped from output (5 minutes). That's a long time for a UPS...

DRuggeri commented 3 years ago

Through some quick experimentation....

A metric set in a previous scrape is only marked stale and continues being reported if the target has not been scraped again. If it is scraped again and that metric is no longer set, the query evaluates as you would expect.

To that end, I've modified my initial example2 above and landed on the following implementation:

If status = OL TRIM, then the exporter returns:

# HELP network_ups_tools_ups_status Value of the ups.status variable from Network UPS Tools
# TYPE network_ups_tools_ups_status gauge
network_ups_tools_ups_status{flag="OL"} 1
network_ups_tools_ups_status{flag="TRIM"} 1

I like this implementation MUCH better than trying to anticipate what statuses a driver may set and then manually coaxing them to an integer.

This makes it trivial to create an alert for network_ups_tools_ups_status{flag="OL"} != 1.

On the downside, my own monitoring solution will need some work. I had built alerting saying "if there is any change in status, set a warning alert" which is now much more difficult to implement since I cannot simply use the changes() function across two scrapes.

sshaikh commented 3 years ago

Great analysis. I think the root cause here is that there's no consistency between USB drivers so the most generic way to handle this will also be the most sparse. Your solution is the most elegant, as long as the reasonable assumption that space is the delimiter holds - let's hope that flag order doesn't matter ;).

This makes it trivial to create an alert for network_ups_tools_ups_status{flag="OL"} != 1.

Does this work if there is no metric with a flag="OL" label? Otherwise this is nice as you can also search on label patterns.

Other software just make the decision to map a new status, as not all combinations exist or make sense (eg OL and OB will never happen at the same time. I think). They then just include another integer status code for OL TRIM which can be included in alerts.

That leads us to another option - for the user of nut_exporter to provide a string->int mapping file that works best for them:

OL,0
OL TRIM,0
DEFAULT,100

And I suppose the original value can be put in as the label:

network_ups_tools_ups_status{status="OL TRIM"} 0

DRuggeri commented 3 years ago

Great analysis. I think the root cause here is that there's no consistency between USB drivers so the most generic way to handle this will also be the most sparse. Your solution is the most elegant, as long as the reasonable assumption that space is the delimiter holds - let's hope that flag order doesn't matter ;).

Haha - no doubt. The inconsistency is the issue. The NUT project makes it a big point that each UPS differs in functionality - surely for this reason.

You are correct in stating that order doesn't matter since we're just splitting the string up and setting the flags that DO appear. It's funny you mention the assumption that space will be the delimiter... I had considered that maybe that could change some day, too (I've not observed a UPS with more than a single status flag set, so am still kinda in 'surprised' mode), but we'll cross that bridge if we ever get to it 😬

This makes it trivial to create an alert for network_ups_tools_ups_status{flag="OL"} != 1.

Does this work if there is no metric with a flag="OL" label? Otherwise this is nice as you can also search on label patterns. I believe so because at the time of evaluation by the prometheus alerting rules, an empty data set being returned would not equal 1. I thought I had an example of this already being used in my alerting configurations, but couldn't find it. I'll have to test to be sure.

Other software just make the decision to map a new status, as not all combinations exist or make sense (eg OL and OB will never happen at the same time. I think). They then just include another integer status code for OL TRIM which can be included in alerts.

That leads us to another option - for the user of nut_exporter to provide a string->int mapping file that works best for them:
OL,0
OL TRIM,0
DEFAULT,100
Aye - that'd be a possible solution as well, but it feels like it would be asking the user to perform a lot more configuration than needed to have a working solution.

What I like most about this solution is that it allows the user to say something along the lines of, "Just tell me if I go on battery" or "Just tell me if the UPS is not 'nominal'". The combination of possible labels does complicate things a bit, but I am testing a few scenarios now to see if we could just rely on using the changes function to detect any/all status changes (which may be more noisy than most users want).

And I suppose the original value can be put in as the label:
network_ups_tools_ups_status{status="OL TRIM"} 0

Yes, I had considered this as well, but I had a few reservations with the idea:

That gets us back into the cardinality issue mentioned earlier as things change
It's not a robust solution against changing of string order by NUT (because who knows if/when that order may change)

This new functionality has been released in v2.0.0

DRuggeri commented 3 years ago

After testing, my assumption was incorrect. It's better to alert by using the absent function if you just want to know that the OL status is not set. I've updated the readme to include this info.

sshaikh commented 3 years ago

Hm, yes, that's the problem with these kind of "transient" non-continuous metrics. How about guaranteeing a network_ups_tools_ups_status{flag="OL"}, ie setting it to 0 if absent? The easy way would to be treat OL as a special case (perhaps alongside other interesting flags), but if you don't want to hardcode you could keep a list of "seen" flags so that once it's published, it always would be.

DRuggeri commented 3 years ago

Yeah... that's a slippery slope that puts us back in the realm of guessing/knowing what the driver will expose and hard coding a constant export of those metrics. I think the prometheus absent function is the best bet for detecting when a flag is not observed during the scrape.

sshaikh commented 3 years ago

It just makes it a little difficult to track multiple UPS's, as absent would not raise an alert if one out of your ten monitored UPS's goes down - unless you specify the names of the UPS (job, instance, whatever) expected in the alerts too.

It also makes it difficult to track when nut/nut_exporter goes down, as a lack of a scrape would look the same as an absent OL, again, unless you count all the statuses present per ups.

I only have two UPS's so I'll probably go the named route for now.

EDIT: Just in case it wasn't clear this is far more preferable to failing on OL TRIM!

DRuggeri commented 3 years ago

Indeed - the more I test/experiment, the more I'm realizing that having a constant 0 for known statuses would be beneficial. I'll experiment some more and make the set of 'always exported' statuses a user-definable parameter with a default of common statuses.

sshaikh commented 3 years ago

Some reassurance about splitting on space:

https://alioth-lists.debian.net/pipermail/nut-upsuser/2020-December/012222.html

DRuggeri commented 3 years ago

I've just released v2.1.0 with the new nut.statuses parameter. I think we should be all set at this point!

sshaikh commented 3 years ago

LGTM. Thanks for the attention!

DRuggeri / nut_exporter

How to best handle the ups.status metric #5