So, we are defining alert rules and passing them through the grafana-agent relation. Juju does its magic for us and automatically includes topology labels for the model, the application, the unit and the charm name.
However, that last bit is problematic. An example rule:
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
expr: sum by (cluster, namespace, pod, container) (irate(container_cpu_usage_seconds_total{image!="",job="kubelet",juju_application="microk8s",juju_charm="microk8s",juju_model="default",juju_model_uuid="bac712ad-17ce-4b14-8ed5-2d7c6c5eff37",metrics_path="/metrics/cadvisor"}[5m])) * on (cluster, namespace, pod) group_left (node) topk by (cluster, namespace, pod) (1, max by (cluster, namespace, pod, node) (kube_pod_info{juju_application="microk8s",juju_charm="microk8s",juju_model="default",juju_model_uuid="bac712ad-17ce-4b14-8ed5-2d7c6c5eff37",node!=""}))
labels:
juju_application: microk8s
juju_charm: microk8s
juju_model: default
juju_model_uuid: bac712ad-17ce-4b14-8ed5-2d7c6c5eff37
This rule includes the juju_charm label when evaluating metrics, and will fail if no such label is included in the metrics. The problem is, this is exactly what happens in grafana-agent: https://github.com/canonical/grafana-agent-operator/blob/main/src/machine_charm.py#L418-L424
However, I am not in favor of changing PrometheusRemoteWriteProvider like this, because it might break other valid scenarios.
My understanding is that the problem arises because the grafana-agent unit has no idea about the charm name of its primary unit.
So, two possible solutions:
like above, do not include the juju_charm label on alerts (somehow)
pass the charm_name name via relation data from the principal unit to the grafana-agent unit, and use it to annotate the alert rules (if available).
I am more inclined towards option 2, even though it would need some minor refactoring on the provider and the requirer side. I also imagine that if that relation data is missing, then we can keep the current behaviour (so perhaps we can skip bumping the library API)
Bug Description
So, we are defining alert rules and passing them through the grafana-agent relation. Juju does its magic for us and automatically includes topology labels for the model, the application, the unit and the charm name.
However, that last bit is problematic. An example rule:
Indeed, running the query expression manually on prometheus returns an empty result, but works if I manually remove the juju_charm="microk8s" label. A hacky solution I tried, which worked, was this https://github.com/neoaggelos/grafana-agent-k8s-operator/commit/7364e4080cec3db17c16e10433d6b0f1e82d5d85
However, I am not in favor of changing PrometheusRemoteWriteProvider like this, because it might break other valid scenarios.
My understanding is that the problem arises because the grafana-agent unit has no idea about the charm name of its primary unit.
So, two possible solutions:
like above, do not include the juju_charm label on alerts (somehow) pass the charm_name name via relation data from the principal unit to the grafana-agent unit, and use it to annotate the alert rules (if available). I am more inclined towards option 2, even though it would need some minor refactoring on the provider and the requirer side. I also imagine that if that relation data is missing, then we can keep the current behaviour (so perhaps we can skip bumping the library API)
To Reproduce
See description above.
Environment
juju deploy grafana-agent --channel edge
Relevant log output
Additional context
No response