Open clux opened 4 months ago
I think I will do a similar setup to prometheus-client. Use the LabelSet functionality to encode the values. I'm trying to think about performance. If the lock on the exemplars is taken, I will not write any value and just skip it (given it's locked, the currently written exemplar is already considered fresh).
prometheus_client doesn't support the additional timestamp feature of exemplars. Is that something that is commonly used or safe to ignore?
prometheus_client doesn't support the additional timestamp feature of exemplars. Is that something that is commonly used or safe to ignore?
i assume it's unimportant to include. it gets correlated with the date of the scrape cycle judging by presentation on grafana, so it will be at most 15s to 1m off (and the trace capture should have more accurate information).
VictoriaMetrics rejected the request to add exemplars https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1169 - quoting that prometheus/grafana seems to have very spotty support for exemplars. Unsure how I think about that.
Hm, yeah, I don't fully agree with the grafana issues in their reasoning. it certainly works for more than histogram_quantile
. Noisy graphs partially feels like a panel UX choice as to whether or not to include datapoints in cluttered visualisations? Both histograms / heatmap type diagrams already benefit from reducing the number of data points to make these graphs less noisy.
I can test it some more with regular counters if that's helpful, the VM comment is pre-grafana 11 at the very least.
Prometheus still has only experimental support for exemplars after more than three years since they were introduced. It stores exemplars in memory, so they are lost after Prometheus restart. This doesn't look like production-ready feature.
Prometheus does seem to conflate experimental features and features with breaking semantic changes together in the same page, and exemplars does not specify they are experimental on prometheus features, but the exemplar query api does. Not sure what to make of this. Didn't know they got wiped on restart. Need to verify this. (EDIT: yes, happens on 2.52 at least).
Aside: the prometheus ecosystem does seem to move extremely slowly in general because the ecosystem is so distributed and usage patterns have calcified (took ~a year just for keep_firing_for
got propagated down into the operator + thanos for instance, and it broke all the linters).
Allow users chaining exemplar metadata onto metrics so we can get nice metric-to-trace links in grafana.
Existing Solutions
Have tried this in
prometheus_client
and their solution was pretty easy;Histogram
withHistogramWithExemplars
:: e.g. https://github.com/kube-rs/controller-rs/blob/6609fcf4db14eabf39d920ad999ad1e4d2578f8a/src/metrics.rs#L50TraceId
Counter::inc_by
/HistogramWithExemplar::observe
datasources::exemplarTraceIdDestinations
Requirements
OpenMetrics format. Since we have a text-encoder, in the naive form it can be pretty easy to retrofit. In the same way as
prometheus_client
just slaps the openmetrics header value on its text format whenever they use exemplars and just always make sure they stay within the openmetrics text spec (afaikt - it works).The format has certain verbosity disadvantages (extra mandatory comments afaikt), so ideally this should all be done via the protobuf encoding down the road, but that's a can of worms in its own right. But it's maybe the text format is a nice stepping stone anyway?
Ideas
I suspect the generics used by
prometheus_client
might be a bit overkill? If we traitify theExemplar
type then we could in theory just haveHistogram::observe_with_exemplar
andCounter::observe_with_exemplar
taking an extraimpl Exemplar
arg without it affecting the type name / generic number.We only store need to store the last seen example-instance (at most one per unique label set) so it's almost like chaining a trace-id valued gauge onto a counter instance. Maybe an extra
Option<RawExemplar>
type field on the rawCounterState/HistogramState
types (as this is zero-byte when unused).Sorry, lots of words and links. Happy to try to help with this. I tried before in upstream tikv/prometheus (2 years ago), but that crate is massive and complicated.