flant / shell-operator

Shell-operator is a tool for running event-driven scripts in a Kubernetes cluster
https://flant.github.io/shell-operator/
Apache License 2.0
2.35k stars 209 forks source link

Lost metrics/wrong labels when same metric name has different labels #616

Closed redbaron closed 6 days ago

redbaron commented 1 month ago

Expected behavior (what you expected to happen):

Each metric producced by a hook should have exactly same labels as reported by the script

Actual behavior (what actually happened):

Looks like shell-operator expects all metrics with the same name has common set of labels. When it differs between metric instances, wrong labels are reported and some metrics can be completely omitted.

Steps to reproduce:

#!/usr/bin/env bash

if [[ ${1:-} == "--config" ]] ; then
  cat <<EOF
configVersion: v1
onStartup: 1
EOF
else
jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {c: "C"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m2", action:"set", value: 1, labels: {a: "A1"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m2", action:"set", value: 2, labels: {c: "C2"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m2", action:"set", value: 3, labels: {a: "A3", b:"B3"}}' >>"$METRICS_PATH"
fi

produces:

# HELP m1 m1
# TYPE m1 gauge
m1{a="",hook="metrics.sh"} 2
m1{a="A",hook="metrics.sh"} 3
# HELP m2 m2
# TYPE m2 gauge
m2{a="",hook="metrics.sh"} 2
m2{a="A1",hook="metrics.sh"} 1
m2{a="A3",hook="metrics.sh"} 3

Environment:

nabokihms commented 2 weeks ago

The problem is confirmed. I reproduced it with the following test:

func Test_DifferentLabels(t *testing.T) {
    g := NewWithT(t)

    buf := &bytes.Buffer{}
    log.SetOutput(buf)

    v := NewGroupedVault()
    v.Registerer = prometheus.DefaultRegisterer

    /*
        https://github.com/flant/shell-operator/issues/616

        jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A"}}' >>"$METRICS_PATH"
        jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {c: "C"}}' >>"$METRICS_PATH"
        jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B"}}' >>"$METRICS_PATH"
        jq -nc '{group:"g",name: "m2", action:"set", value: 1, labels: {a: "A1"}}' >>"$METRICS_PATH"
        jq -nc '{group:"g",name: "m2", action:"set", value: 2, labels: {c: "C2"}}' >>"$METRICS_PATH"
        jq -nc '{group:"g",name: "m2", action:"set", value: 3, labels: {a: "A3", b:"B3"}}' >>"$METRICS_PATH"
    */

    v.GaugeSet("g", "m1", 1.0, map[string]string{"a": "A"})
    v.GaugeSet("g", "m1", 2.0, map[string]string{"c": "C"})
    v.GaugeSet("g", "m1", 3.0, map[string]string{"a": "A", "b": "B"})
    v.GaugeSet("g", "m2", 1.0, map[string]string{"a": "A1"})
    v.GaugeSet("g", "m2", 2.0, map[string]string{"c": "C2"})
    v.GaugeSet("g", "m2", 3.0, map[string]string{"a": "A3", "b": "B3"})

    g.Expect(buf.String()).ShouldNot(ContainSubstring("error"), "error occurred in log: %s", buf.String())

    expect := `
# HELP m1 m1
# TYPE m1 gauge
m1{a=""} 2
m1{a="A"} 3
# HELP m2 m2
# TYPE m2 gauge
m2{a=""} 2
m2{a="A1"} 1
m2{a="A3"} 3
`
    err := promtest.GatherAndCompare(prometheus.DefaultGatherer, strings.NewReader(expect), "m1", "m2")
    g.Expect(err).ShouldNot(HaveOccurred())
}
nabokihms commented 2 weeks ago

So, it happens because of how shell operator creates new gauge collectors. It uses a single method for getting and creating: https://github.com/flant/shell-operator/blob/c07eb4ccd58cfe0cf647c2f073a32f474182e53f/pkg/metric_storage/vault/vault.go#L51-L66

labelNames are always first operation labels. It seems like before applying, shell-operator has to calculate the set of names for the metric.

miklezzzz commented 1 week ago

I did some research and according to our documentation, which states: The metric name is used as-is, so several hooks can export the same metric name. It is responsibility of hooks‘ developer to maintain consistent label cardinality. providing such a set of metrics as follows

jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {c: "C"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B"}}' >>"$METRICS_PATH"

is rather incorrect. I interpret the documentation correctly, don't I? We could update metrics operations so that it would transform the aforementioned metrics to something like this on the user's behalf:

jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A", b: "", c: ""}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {a: "", b:"", c: "C"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B", c: ""}}' >>"$METRICS_PATH"

but metrics operations take place on a hook basis, so if two or more hooks operate the same metrics, but with different label sets, we can't aggregate all the labels from all hooks.

redbaron commented 1 week ago

The metric name is used as-is, so several hooks can export the same metric name. It is responsibility of hooks‘ developer to maintain consistent label cardinality.

There is no inherit requirement in prometheus to have same label set for a given metric name, where this constraint is coming from then?

I guess if it documented to be like that, thne it is more a feature request, than a bug.

miklezzzz commented 1 week ago

I found following guidance about metrics' labels

Client libraries MUST NOT allow users to have different label names for the same metric for Gauge/Counter/Summary/Histogram or any other Collector offered by the library.

Metrics from custom collectors should almost always have consistent label names. As there are still rare but valid use cases where this is not the case, client libraries should not verify this. There https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels , but I'm not sure how to interpret it correctly.

miklezzzz commented 1 week ago

@redbaron what do you think if we deal with the problem this way? https://github.com/flant/shell-operator/blob/521f98cc7615806e202c2ac69e65edcc7a2d2eec/pkg/metric_storage/vault/vault_test.go#L106-L139

would it suit you?

redbaron commented 1 week ago

As long as label values are not lost like before it is an improvement.