Closed lathiat closed 7 months ago
Hi @lathiat,
Agreeing that this isn't that ideal. Let me try to address this one by one:
As for the telemetry being tagged with the grafana agent topology, that is indeed intended and (I think) the less of two evils if you consider use cases where two principals (or just two charms really) are deployed to the same machine. Unless a charm natively integrates over the cos_agent
interface and says "I want a node exporter scrape job where all the labels are set to my topology", it seems fair that they would default to the agent, as that's the only way to get a deterministic deployment in these cases. Same goes for logs.
As for the hostname being overridden, yes - we should definitely fix that. It should be tagged with the hostname of the VM, not the topology. Having the topology populate hostname makes total sense in k8s world, where machine identities aren't really a useful thing, but for VMs, it's a vital part of the fingerprint.
Let's use this ticket as a bug report where we address part 2. Does that make sense?
Best, Simon
OK after much re-reading through the different issues a few times, it seems we have 3 separate cases to reason about here.
When the same principal unit (e.g. ceph-mon/0) is related to two different COS subordinates, e.g. prometheus-scrape-config and grafana-agent. In that case, both subordinates may create labels or names based on the common principal unit name (ceph-mon/0), that would sometimes they would overwrite each others rules (https://github.com/canonical/prometheus-k8s-operator/issues/551)
At least I thought that is what this was, except it seems prometheus-scrape-config-ceph was really a principal charm in the COS model... and not a subordinate on ceph-mon. I'm not sure how the principal unit was being passed through that relation. However it mostly still applies.. the point seems to have been that the same named item was configured by two different "subordinates" (that isn't really a subordinate in this actual case, but might be in some other cases).
ceph-mon/0*
grafana-agent/0
prometheus-scrape-config/0
When the same machine has 2 different principal units installed, both of which are related to the same grafana-agent subordinate.
In this case we have two different principal unit names, and 2 different subordinate unit names (e.g. principal ceph-osd/2 and ubuntu/2, subordinate grafana-agent/3 and grafana-agent/8), all on the same machine and with the same actual installation of grafana-agent.
Unit Workload Agent Machine Public address Ports Message
ceph-osd/2* active idle 3 172.16.0.50 Unit is ready (1 OSD)
grafana-agent/3* active idle 172.16.0.50
ubuntu/0* active idle 3 172.16.0.50
grafana-agent/8 active idle 172.16.0.50
When the same subordinate (e.g. grafana-agent) is related to multiple principal units, on different machines (https://github.com/canonical/grafana-agent-operator/issues/17)
It's unclear to me why it was getting confused in this case, since the two principal units were on different machines. Except possibly the note that it was "difficult to get the principal unit", which I address below.
kafka/0* active idle 0 redacted_subnet.4
grafana-agent/10 active idle redacted_subnet.4
zookeeper/0 active idle 3 redacted_subnet.7
grafana-agent/6 active idle redacted_subnet.7
Identifying rules and metrics based on the subordinate name does help with Case 1, as it seems both subordinate charms would generate metrics or rules with the same name. However it now means we have lost the ability to filter based on the Principal Name. Having to work with grafana-agent/N and not being able to reason about principals like ceph-mon/0, ceph-osd/0, etc.. is a serious blocker in my view and I don't see it being a usable observability system that way. I'd like to keep the bug about that specific issue, the hostname part is related but more minor.
https://github.com/canonical/grafana-agent-operator/pull/47 claimed that it was difficult to determine the principal unit from the charm code, however all of the existing LMA charms have long been doing this. JUJU_PRINCIPAL is generally passed into hooks though I would note that the filebeat charm at least seemed to have to cache this as it might not always be available? I don't immediately see details of when it isn't: https://github.com/canonical/layer-beats-base/pull/26/files
It seems to me the real solution would be to label and name the metrics based on both the principal and subordinate, so that we can still filter for metrics on the principal, but the unique identifiers for rules etc would still be unique by having the subordinate also listed. In the case of 1 Machine with 2 Principals, we may duplicate some collections, if the data is collected twice with the two principal names, but the cardinality expansion should I think be limited, to only the number of Principal units - and we usually only have 1 maybe 2. It's very rare to have more than 2 Principal applications on the same machine. Assuming that we d
It wouldn't surprise me if I have missed something here, and I am pretty green to prometheus+loki, so please let me know what I have missed, it was a bit to get my head around. But I think that the key points are that
If you read the above comment in an e-mail, I made a couple minor edits shortly after posting, so it would be best to read the latest version.
So I would like to discuss logs and metrics separately.
Logs will support having both labels once #46 is completed. Grafana-agent will send a "standard" set of logs with it's own labels and the charm can request that specific logs files be labeled with its topology. Additionally, logs originating from snaps will get the labels of the charm that declared them.
The metrics story is a bit different. We have decided to label any metrics generated by grafana-agent (node-exporter) with grafana-agent's topology and and metrics which we scrape from the application get the application's topology. So any dashboards or rules provided should work just fine with the application labels.
Hi @lathiat,
To elaborate on @dstathis's response above, it would be handy if you could post a minimal ceph-osd bundle (e.g. juju export-bundle
, trimmed down to the bare necessities). In the meanwhile, I think I was able to reproduce the issues in the following deployment.
# lxd model
series: jammy
saas:
loki:
url: microk8s:admin/pebnote.loki
prom:
url: microk8s:admin/pebnote.prom
applications:
ga:
charm: grafana-agent
channel: edge
revision: 52
ub:
charm: ubuntu
channel: edge
revision: 24
num_units: 1
to:
- "0"
ubu:
charm: ubuntu
channel: edge
revision: 24
num_units: 1
to:
- "0"
zk:
charm: zookeeper
channel: 3/edge
revision: 125
num_units: 1
to:
- "1"
trust: true
machines:
"0":
constraints: arch=amd64
"1":
constraints: arch=amd64
relations:
- - ga:juju-info
- ub:juju-info
- - ga:juju-info
- ubu:juju-info
- - ga:logging-consumer
- loki:logging
- - ga:send-remote-write
- prom:receive-remote-write
- - ga:cos-agent
- zk:cos-agent
# microk8s model
bundle: kubernetes
saas:
remote-a62e4e5eeec84aa78034f543c0218901: {}
applications:
loki:
charm: loki-k8s
channel: edge
revision: 121
resources:
loki-image: 91
scale: 1
trust: true
prom:
charm: prometheus-k8s
channel: edge
revision: 170
resources:
prometheus-image: 139
scale: 1
trust: true
relations:
- - loki:logging
- remote-a62e4e5eeec84aa78034f543c0218901:logging-consumer
- - prom:receive-remote-write
- remote-a62e4e5eeec84aa78034f543c0218901:send-remote-write
--- # overlay.yaml
applications:
loki:
offers:
loki:
endpoints:
- logging
acl:
admin: admin
prom:
offers:
prom:
endpoints:
- receive-remote-write
acl:
admin: admin
graph LR
subgraph lxd
ub --- ga
ubu --- ga
zk --- ga
end
subgraph microk8s
prom
loki
end
ga --- prom
ga --- loki
graph TD
subgraph machine-0
ub/0
ubu/0
subgraph subord1[subordiantes]
ga/0
ga/1
end
ub/0 --- ga/0
ubu/0 --- ga/1
end
subgraph machine-1
zk/0
subgraph subord2[subordiantes]
ga/2
end
zk/0 --- ga/2
end
Just wanted to mention that scenario 2 above is only unsupported for now. We plan on supporting it in the future.
Hi @lathiat, I went ahead and merged the associated PR to keep things well scoped. I know you're testing these changes so please open follow up issues for anything odd you encounter!
@sed-i Issue need to be reopened rev 88 on latest/edge has exactly the same issue as described here.
Enhancement Proposal
When deploying cos-lite edge and relating it against a Ceph deployment, both the Loki logs and host metrics (e.g. CPU/Disk/etc) are labelled according to the grafana-agent subordinate.
Instead of ceph-mon/0, ceph-osd/{0,1,2} and ceph-rgw/{0,1,2} all appear in loki tagged as juju_application=grafana-agent, juju_unit=grafana-agent/{0,1,2,3,4,5,6,7}. I cannot filter for ceph-osd applications. Similarly under a grafana dashboard such as "System Resources" the hostname is {MODEL_NAME}-{MODEL_UUID}_grafana-agent_grafana-agent/7.
This is not really helpful, as a user of the system I need to be able to easily drill down or select hosts based on the application like ceph-osd, having to translate from ceph-osd/N to a sea of grafana-agent/N is not really practical.
Under the LMA stack, they would be tagged with the principal charm name instead, which is much more useful.
It seems this change was very recently made in #47 to fix another issue #17. However it is now very difficult to actually use the dashboards.
In some cases, it may be possible to resolve this by adding additional labels such as the principal application/unit, however that won't help so much for the "Hostname" side of things. So I guess some more thought into balancing this usability with the requirements of the original issue is needed.