Closed redbaron closed 6 days ago
The problem is confirmed. I reproduced it with the following test:
func Test_DifferentLabels(t *testing.T) {
g := NewWithT(t)
buf := &bytes.Buffer{}
log.SetOutput(buf)
v := NewGroupedVault()
v.Registerer = prometheus.DefaultRegisterer
/*
https://github.com/flant/shell-operator/issues/616
jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {c: "C"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m2", action:"set", value: 1, labels: {a: "A1"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m2", action:"set", value: 2, labels: {c: "C2"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m2", action:"set", value: 3, labels: {a: "A3", b:"B3"}}' >>"$METRICS_PATH"
*/
v.GaugeSet("g", "m1", 1.0, map[string]string{"a": "A"})
v.GaugeSet("g", "m1", 2.0, map[string]string{"c": "C"})
v.GaugeSet("g", "m1", 3.0, map[string]string{"a": "A", "b": "B"})
v.GaugeSet("g", "m2", 1.0, map[string]string{"a": "A1"})
v.GaugeSet("g", "m2", 2.0, map[string]string{"c": "C2"})
v.GaugeSet("g", "m2", 3.0, map[string]string{"a": "A3", "b": "B3"})
g.Expect(buf.String()).ShouldNot(ContainSubstring("error"), "error occurred in log: %s", buf.String())
expect := `
# HELP m1 m1
# TYPE m1 gauge
m1{a=""} 2
m1{a="A"} 3
# HELP m2 m2
# TYPE m2 gauge
m2{a=""} 2
m2{a="A1"} 1
m2{a="A3"} 3
`
err := promtest.GatherAndCompare(prometheus.DefaultGatherer, strings.NewReader(expect), "m1", "m2")
g.Expect(err).ShouldNot(HaveOccurred())
}
So, it happens because of how shell operator creates new gauge collectors. It uses a single method for getting and creating: https://github.com/flant/shell-operator/blob/c07eb4ccd58cfe0cf647c2f073a32f474182e53f/pkg/metric_storage/vault/vault.go#L51-L66
labelNames
are always first operation labels. It seems like before applying, shell-operator has to calculate the set of names for the metric.
I did some research and according to our documentation, which states: The metric name is used as-is, so several hooks can export the same metric name. It is responsibility of hooks‘ developer to maintain consistent label cardinality.
providing such a set of metrics as follows
jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {c: "C"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B"}}' >>"$METRICS_PATH"
is rather incorrect. I interpret the documentation correctly, don't I? We could update metrics operations so that it would transform the aforementioned metrics to something like this on the user's behalf:
jq -nc '{group:"g",name: "m1", action:"set", value: 1, labels: {a: "A", b: "", c: ""}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 2, labels: {a: "", b:"", c: "C"}}' >>"$METRICS_PATH"
jq -nc '{group:"g",name: "m1", action:"set", value: 3, labels: {a: "A", b:"B", c: ""}}' >>"$METRICS_PATH"
but metrics operations take place on a hook basis, so if two or more hooks operate the same metrics, but with different label sets, we can't aggregate all the labels from all hooks.
The metric name is used as-is, so several hooks can export the same metric name. It is responsibility of hooks‘ developer to maintain consistent label cardinality.
There is no inherit requirement in prometheus to have same label set for a given metric name, where this constraint is coming from then?
I guess if it documented to be like that, thne it is more a feature request, than a bug.
I found following guidance about metrics' labels
Client libraries MUST NOT allow users to have different label names for the same metric for Gauge/Counter/Summary/Histogram or any other Collector offered by the library.
Metrics from custom collectors should almost always have consistent label names. As there are still rare but valid use cases where this is not the case, client libraries should not verify this. There https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels , but I'm not sure how to interpret it correctly.
@redbaron what do you think if we deal with the problem this way? https://github.com/flant/shell-operator/blob/521f98cc7615806e202c2ac69e65edcc7a2d2eec/pkg/metric_storage/vault/vault_test.go#L106-L139
would it suit you?
As long as label values are not lost like before it is an improvement.
Expected behavior (what you expected to happen):
Each metric producced by a hook should have exactly same labels as reported by the script
Actual behavior (what actually happened):
Looks like shell-operator expects all metrics with the same name has common set of labels. When it differs between metric instances, wrong labels are reported and some metrics can be completely omitted.
Steps to reproduce:
produces:
Environment: