conradludgate / measured

A better metrics crate
26 stars 3 forks source link

Add `Exemplar` support #4

Open clux opened 4 months ago

clux commented 4 months ago

Allow users chaining exemplar metadata onto metrics so we can get nice metric-to-trace links in grafana.

Existing Solutions

Have tried this in prometheus_client and their solution was pretty easy;

Requirements

OpenMetrics format. Since we have a text-encoder, in the naive form it can be pretty easy to retrofit. In the same way as prometheus_client just slaps the openmetrics header value on its text format whenever they use exemplars and just always make sure they stay within the openmetrics text spec (afaikt - it works).

The format has certain verbosity disadvantages (extra mandatory comments afaikt), so ideally this should all be done via the protobuf encoding down the road, but that's a can of worms in its own right. But it's maybe the text format is a nice stepping stone anyway?

Ideas

I suspect the generics used by prometheus_client might be a bit overkill? If we traitify the Exemplar type then we could in theory just have Histogram::observe_with_exemplar and Counter::observe_with_exemplar taking an extra impl Exemplar arg without it affecting the type name / generic number.

We only store need to store the last seen example-instance (at most one per unique label set) so it's almost like chaining a trace-id valued gauge onto a counter instance. Maybe an extra Option<RawExemplar> type field on the raw CounterState/HistogramState types (as this is zero-byte when unused).

Sorry, lots of words and links. Happy to try to help with this. I tried before in upstream tikv/prometheus (2 years ago), but that crate is massive and complicated.

conradludgate commented 4 months ago

I think I will do a similar setup to prometheus-client. Use the LabelSet functionality to encode the values. I'm trying to think about performance. If the lock on the exemplars is taken, I will not write any value and just skip it (given it's locked, the currently written exemplar is already considered fresh).

conradludgate commented 4 months ago

prometheus_client doesn't support the additional timestamp feature of exemplars. Is that something that is commonly used or safe to ignore?

clux commented 4 months ago

prometheus_client doesn't support the additional timestamp feature of exemplars. Is that something that is commonly used or safe to ignore?

i assume it's unimportant to include. it gets correlated with the date of the scrape cycle judging by presentation on grafana, so it will be at most 15s to 1m off (and the trace capture should have more accurate information).

conradludgate commented 4 months ago

VictoriaMetrics rejected the request to add exemplars https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1169 - quoting that prometheus/grafana seems to have very spotty support for exemplars. Unsure how I think about that.

clux commented 4 months ago

Hm, yeah, I don't fully agree with the grafana issues in their reasoning. it certainly works for more than histogram_quantile. Noisy graphs partially feels like a panel UX choice as to whether or not to include datapoints in cluttered visualisations? Both histograms / heatmap type diagrams already benefit from reducing the number of data points to make these graphs less noisy.

I can test it some more with regular counters if that's helpful, the VM comment is pre-grafana 11 at the very least.

Prometheus still has only experimental support for exemplars after more than three years since they were introduced. It stores exemplars in memory, so they are lost after Prometheus restart. This doesn't look like production-ready feature.

Prometheus does seem to conflate experimental features and features with breaking semantic changes together in the same page, and exemplars does not specify they are experimental on prometheus features, but the exemplar query api does. Not sure what to make of this. Didn't know they got wiped on restart. Need to verify this. (EDIT: yes, happens on 2.52 at least).

Aside: the prometheus ecosystem does seem to move extremely slowly in general because the ecosystem is so distributed and usage patterns have calcified (took ~a year just for keep_firing_for got propagated down into the operator + thanos for instance, and it broke all the linters).