Open mpilman opened 5 years ago
@mpilman Did you mean *Metrics TraceEvents, the metrics reported in status json
, or both? Is this instead better served by publishing guidance on what to monitor rather than publishing documentation of the metrics and leaving how to translate that to monitoring to each user?
I think at the time there was some discussion about documenting the *Metrics
trace events. Some of the fields from these events have derivatives that show up in status json, but many of them don't.
It may or may not make sense to actively monitor all of the metrics from these trace event fields, but they do tend to be pretty valuable to helping to fill out the picture of what's happening in a cluster when it starts reporting problems. I'd recommend that anybody operating a cluster should be familiar with what these metrics events offer, and potentially they could choose which ones are most relevant to their situation to monitor more closely. Given that, documentation of these metrics would probably be pretty helpful.
There are also some other ideas elaborated on in other issues about making more of these metrics consumable outside of parsing trace events, which may make it easier to frame it like you suggest @alexmiller-apple.
Sorry for my late reply - I am finally back from my vacation 😄
A.J. already said mostly everything there is to say. Basically, if we have production issues (or also if we make infrastructure changes that we believe will help performance or cost) we might be interested in some specific metrics that we don't really monitor (some recent examples are cache hit rate for disk and io_submit
latencies).
Currently the way we do this is by looking into the code and figure out manually what the metrics actually report or whether we can find something useful. This has three major problems:
I hadn't seen this before I did this but I thought a full list of all trace event metrics with descriptions might be useful so I've put together a wiki page that lists all trace event metrics, an example value, and an empty field for a description. The way I see this working is that developers/operators will add descriptions if they do not find them in the wiki page sort of like a read-through cache.
I created this list by scraping the logs I had on a longish running cluster so it may not be complete. I'll have to grep the code to see if I missed any.
The wiki page can be found here: https://github.com/apple/foundationdb/wiki/FoundationDB-Trace-Metrics-Definitions
Update: It's definitely not complete, I'll have to get a larger set of logs and re-run the script I used to create the wiki page. It might also be useful to order the trace events by frequency rather than alphabetically. I'll have a think.
I think we need a documentation-page for all Metrics that we emit. We are expecting people to monitor those but it is sometimes very hard to figure out what they actually mean.