All metrics/logs labelled as subordinate 'grafana-agent/N', cannot filter or see the principal charm name/unit number

lathiat commented 7 months ago

Enhancement Proposal

When deploying cos-lite edge and relating it against a Ceph deployment, both the Loki logs and host metrics (e.g. CPU/Disk/etc) are labelled according to the grafana-agent subordinate.

Instead of ceph-mon/0, ceph-osd/{0,1,2} and ceph-rgw/{0,1,2} all appear in loki tagged as juju_application=grafana-agent, juju_unit=grafana-agent/{0,1,2,3,4,5,6,7}. I cannot filter for ceph-osd applications. Similarly under a grafana dashboard such as "System Resources" the hostname is {MODEL_NAME}-{MODEL_UUID}_grafana-agent_grafana-agent/7.

This is not really helpful, as a user of the system I need to be able to easily drill down or select hosts based on the application like ceph-osd, having to translate from ceph-osd/N to a sea of grafana-agent/N is not really practical.

Under the LMA stack, they would be tagged with the principal charm name instead, which is much more useful.

It seems this change was very recently made in #47 to fix another issue #17. However it is now very difficult to actually use the dashboards.

In some cases, it may be possible to resolve this by adding additional labels such as the principal application/unit, however that won't help so much for the "Hostname" side of things. So I guess some more thought into balancing this usability with the requirements of the original issue is needed.

simskij commented 7 months ago

Hi @lathiat,

Agreeing that this isn't that ideal. Let me try to address this one by one:

As for the telemetry being tagged with the grafana agent topology, that is indeed intended and (I think) the less of two evils if you consider use cases where two principals (or just two charms really) are deployed to the same machine. Unless a charm natively integrates over the cos_agent interface and says "I want a node exporter scrape job where all the labels are set to my topology", it seems fair that they would default to the agent, as that's the only way to get a deterministic deployment in these cases. Same goes for logs.
As for the hostname being overridden, yes - we should definitely fix that. It should be tagged with the hostname of the VM, not the topology. Having the topology populate hostname makes total sense in k8s world, where machine identities aren't really a useful thing, but for VMs, it's a vital part of the fingerprint.

Let's use this ticket as a bug report where we address part 2. Does that make sense?

Best, Simon

lathiat commented 7 months ago

OK after much re-reading through the different issues a few times, it seems we have 3 separate cases to reason about here.

Case 1: 1 Machine with 1 Principal, 2 Subordinates

When the same principal unit (e.g. ceph-mon/0) is related to two different COS subordinates, e.g. prometheus-scrape-config and grafana-agent. In that case, both subordinates may create labels or names based on the common principal unit name (ceph-mon/0), that would sometimes they would overwrite each others rules (https://github.com/canonical/prometheus-k8s-operator/issues/551)

At least I thought that is what this was, except it seems prometheus-scrape-config-ceph was really a principal charm in the COS model... and not a subordinate on ceph-mon. I'm not sure how the principal unit was being passed through that relation. However it mostly still applies.. the point seems to have been that the same named item was configured by two different "subordinates" (that isn't really a subordinate in this actual case, but might be in some other cases).

ceph-mon/0*
  grafana-agent/0
  prometheus-scrape-config/0

Case 2: 1 Machine with 2 Principals, 1 Subordinate instantiated twice (e.g. ceph-osd/2 with grafana-agent/3 and ubuntu/0 with grafana-agent/8).

When the same machine has 2 different principal units installed, both of which are related to the same grafana-agent subordinate.

In this case we have two different principal unit names, and 2 different subordinate unit names (e.g. principal ceph-osd/2 and ubuntu/2, subordinate grafana-agent/3 and grafana-agent/8), all on the same machine and with the same actual installation of grafana-agent.

Unit                Workload  Agent  Machine  Public address  Ports  Message
ceph-osd/2*         active    idle   3        172.16.0.50            Unit is ready (1 OSD)
  grafana-agent/3*  active    idle            172.16.0.50
ubuntu/0*           active    idle   3        172.16.0.50
  grafana-agent/8   active    idle            172.16.0.50

Case 3: 2 Machines each with 1 different principal unit, which is related to the same 1 subordinate (e.g. kafka/0 and zookepr/0 both related to grafana-agent).

When the same subordinate (e.g. grafana-agent) is related to multiple principal units, on different machines (https://github.com/canonical/grafana-agent-operator/issues/17)

It's unclear to me why it was getting confused in this case, since the two principal units were on different machines. Except possibly the note that it was "difficult to get the principal unit", which I address below.

kafka/0*                      active    idle   0        redacted_subnet.4
  grafana-agent/10            active    idle            redacted_subnet.4
zookeeper/0                   active    idle   3        redacted_subnet.7
  grafana-agent/6             active    idle            redacted_subnet.7

Analysis

Identifying rules and metrics based on the subordinate name does help with Case 1, as it seems both subordinate charms would generate metrics or rules with the same name. However it now means we have lost the ability to filter based on the Principal Name. Having to work with grafana-agent/N and not being able to reason about principals like ceph-mon/0, ceph-osd/0, etc.. is a serious blocker in my view and I don't see it being a usable observability system that way. I'd like to keep the bug about that specific issue, the hostname part is related but more minor.

https://github.com/canonical/grafana-agent-operator/pull/47 claimed that it was difficult to determine the principal unit from the charm code, however all of the existing LMA charms have long been doing this. JUJU_PRINCIPAL is generally passed into hooks though I would note that the filebeat charm at least seemed to have to cache this as it might not always be available? I don't immediately see details of when it isn't: https://github.com/canonical/layer-beats-base/pull/26/files

It seems to me the real solution would be to label and name the metrics based on both the principal and subordinate, so that we can still filter for metrics on the principal, but the unique identifiers for rules etc would still be unique by having the subordinate also listed. In the case of 1 Machine with 2 Principals, we may duplicate some collections, if the data is collected twice with the two principal names, but the cardinality expansion should I think be limited, to only the number of Principal units - and we usually only have 1 maybe 2. It's very rare to have more than 2 Principal applications on the same machine. Assuming that we d

It wouldn't surprise me if I have missed something here, and I am pretty green to prometheus+loki, so please let me know what I have missed, it was a bit to get my head around. But I think that the key points are that

In my view, it's a critical requirement to be able to filter for metrics or logs based directly on the principal unit name (and ideally also based on the hostname, but I am slightly less worried about that), but
I agree that we also need to uniquely identify things based on the subordinate name in some cases as an implementation detail - but that shouldn't prevent directly filtering solely on the principal unit name.

lathiat commented 7 months ago

If you read the above comment in an e-mail, I made a couple minor edits shortly after posting, so it would be best to read the latest version.

dstathis commented 7 months ago

So I would like to discuss logs and metrics separately.

Logs

Logs will support having both labels once #46 is completed. Grafana-agent will send a "standard" set of logs with it's own labels and the charm can request that specific logs files be labeled with its topology. Additionally, logs originating from snaps will get the labels of the charm that declared them.

Metrics

The metrics story is a bit different. We have decided to label any metrics generated by grafana-agent (node-exporter) with grafana-agent's topology and and metrics which we scrape from the application get the application's topology. So any dashboards or rules provided should work just fine with the application labels.

sed-i commented 7 months ago

Hi @lathiat,

Case 1: see reproduction below.
Case 2: see reproduction below. Essentially this mode of operation isn't supported because grafana-agent is a snap and we can currently install only one snap of the same kind. This means that the grafana agent charms would be uninstalling each other's snap and overwrite and use each other's grafana-agent.yaml config. So this use case is currently a non-goal.
Case 3: see reproduction below. Subordinate relations are confusing.

Reproduction

To elaborate on @dstathis's response above, it would be handy if you could post a minimal ceph-osd bundle (e.g. juju export-bundle, trimmed down to the bare necessities). In the meanwhile, I think I was able to reproduce the issues in the following deployment.

# lxd model
series: jammy
saas:
  loki:
    url: microk8s:admin/pebnote.loki
  prom:
    url: microk8s:admin/pebnote.prom
applications:
  ga:
    charm: grafana-agent
    channel: edge
    revision: 52
  ub:
    charm: ubuntu
    channel: edge
    revision: 24
    num_units: 1
    to:
    - "0"
  ubu:
    charm: ubuntu
    channel: edge
    revision: 24
    num_units: 1
    to:
    - "0"
  zk:
    charm: zookeeper
    channel: 3/edge
    revision: 125
    num_units: 1
    to:
    - "1"
    trust: true
machines:
  "0":
    constraints: arch=amd64
  "1":
    constraints: arch=amd64
relations:
- - ga:juju-info
  - ub:juju-info
- - ga:juju-info
  - ubu:juju-info
- - ga:logging-consumer
  - loki:logging
- - ga:send-remote-write
  - prom:receive-remote-write
- - ga:cos-agent
  - zk:cos-agent

# microk8s model
bundle: kubernetes
saas:
  remote-a62e4e5eeec84aa78034f543c0218901: {}
applications:
  loki:
    charm: loki-k8s
    channel: edge
    revision: 121
    resources:
      loki-image: 91
    scale: 1
    trust: true
  prom:
    charm: prometheus-k8s
    channel: edge
    revision: 170
    resources:
      prometheus-image: 139
    scale: 1
    trust: true
relations:
- - loki:logging
  - remote-a62e4e5eeec84aa78034f543c0218901:logging-consumer
- - prom:receive-remote-write
  - remote-a62e4e5eeec84aa78034f543c0218901:send-remote-write
--- # overlay.yaml
applications:
  loki:
    offers:
      loki:
        endpoints:
        - logging
        acl:
          admin: admin
  prom:
    offers:
      prom:
        endpoints:
        - receive-remote-write
        acl:
          admin: admin

Relation view

graph LR
subgraph lxd
ub --- ga
ubu --- ga
zk --- ga
end

subgraph microk8s
prom
loki
end

ga --- prom
ga --- loki

Machine view

graph TD
subgraph machine-0
ub/0
ubu/0

subgraph subord1[subordiantes]
ga/0
ga/1
end

ub/0 --- ga/0
ubu/0 --- ga/1
end

subgraph machine-1
zk/0

subgraph subord2[subordiantes]
ga/2
end

zk/0 --- ga/2
end

dstathis commented 7 months ago

Just wanted to mention that scenario 2 above is only unsupported for now. We plan on supporting it in the future.

sed-i commented 7 months ago

Hi @lathiat, I went ahead and merged the associated PR to keep things well scoped. I know you're testing these changes so please open follow up issues for anything odd you encounter!

err404r commented 5 months ago

@sed-i Issue need to be reopened rev 88 on latest/edge has exactly the same issue as described here.

canonical / grafana-agent-operator