canonical / grafana-agent-operator

https://charmhub.io/grafana-agent
Apache License 2.0
4 stars 10 forks source link

All metrics/logs labelled as subordinate 'grafana-agent/N', cannot filter or see the principal charm name/unit number #60

Closed lathiat closed 7 months ago

lathiat commented 7 months ago

Enhancement Proposal

When deploying cos-lite edge and relating it against a Ceph deployment, both the Loki logs and host metrics (e.g. CPU/Disk/etc) are labelled according to the grafana-agent subordinate.

Instead of ceph-mon/0, ceph-osd/{0,1,2} and ceph-rgw/{0,1,2} all appear in loki tagged as juju_application=grafana-agent, juju_unit=grafana-agent/{0,1,2,3,4,5,6,7}. I cannot filter for ceph-osd applications. Similarly under a grafana dashboard such as "System Resources" the hostname is {MODEL_NAME}-{MODEL_UUID}_grafana-agent_grafana-agent/7.

This is not really helpful, as a user of the system I need to be able to easily drill down or select hosts based on the application like ceph-osd, having to translate from ceph-osd/N to a sea of grafana-agent/N is not really practical.

Under the LMA stack, they would be tagged with the principal charm name instead, which is much more useful.

It seems this change was very recently made in #47 to fix another issue #17. However it is now very difficult to actually use the dashboards.

In some cases, it may be possible to resolve this by adding additional labels such as the principal application/unit, however that won't help so much for the "Hostname" side of things. So I guess some more thought into balancing this usability with the requirements of the original issue is needed.

image image

simskij commented 7 months ago

Hi @lathiat,

Agreeing that this isn't that ideal. Let me try to address this one by one:

Let's use this ticket as a bug report where we address part 2. Does that make sense?

Best, Simon

lathiat commented 7 months ago

OK after much re-reading through the different issues a few times, it seems we have 3 separate cases to reason about here.

Case 1: 1 Machine with 1 Principal, 2 Subordinates

When the same principal unit (e.g. ceph-mon/0) is related to two different COS subordinates, e.g. prometheus-scrape-config and grafana-agent. In that case, both subordinates may create labels or names based on the common principal unit name (ceph-mon/0), that would sometimes they would overwrite each others rules (https://github.com/canonical/prometheus-k8s-operator/issues/551)

At least I thought that is what this was, except it seems prometheus-scrape-config-ceph was really a principal charm in the COS model... and not a subordinate on ceph-mon. I'm not sure how the principal unit was being passed through that relation. However it mostly still applies.. the point seems to have been that the same named item was configured by two different "subordinates" (that isn't really a subordinate in this actual case, but might be in some other cases).

ceph-mon/0*
  grafana-agent/0
  prometheus-scrape-config/0

Case 2: 1 Machine with 2 Principals, 1 Subordinate instantiated twice (e.g. ceph-osd/2 with grafana-agent/3 and ubuntu/0 with grafana-agent/8).

When the same machine has 2 different principal units installed, both of which are related to the same grafana-agent subordinate.

In this case we have two different principal unit names, and 2 different subordinate unit names (e.g. principal ceph-osd/2 and ubuntu/2, subordinate grafana-agent/3 and grafana-agent/8), all on the same machine and with the same actual installation of grafana-agent.

Unit                Workload  Agent  Machine  Public address  Ports  Message
ceph-osd/2*         active    idle   3        172.16.0.50            Unit is ready (1 OSD)
  grafana-agent/3*  active    idle            172.16.0.50
ubuntu/0*           active    idle   3        172.16.0.50
  grafana-agent/8   active    idle            172.16.0.50

Case 3: 2 Machines each with 1 different principal unit, which is related to the same 1 subordinate (e.g. kafka/0 and zookepr/0 both related to grafana-agent).

When the same subordinate (e.g. grafana-agent) is related to multiple principal units, on different machines (https://github.com/canonical/grafana-agent-operator/issues/17)

It's unclear to me why it was getting confused in this case, since the two principal units were on different machines. Except possibly the note that it was "difficult to get the principal unit", which I address below.

kafka/0*                      active    idle   0        redacted_subnet.4
  grafana-agent/10            active    idle            redacted_subnet.4
zookeeper/0                   active    idle   3        redacted_subnet.7
  grafana-agent/6             active    idle            redacted_subnet.7

Analysis

Identifying rules and metrics based on the subordinate name does help with Case 1, as it seems both subordinate charms would generate metrics or rules with the same name. However it now means we have lost the ability to filter based on the Principal Name. Having to work with grafana-agent/N and not being able to reason about principals like ceph-mon/0, ceph-osd/0, etc.. is a serious blocker in my view and I don't see it being a usable observability system that way. I'd like to keep the bug about that specific issue, the hostname part is related but more minor.

https://github.com/canonical/grafana-agent-operator/pull/47 claimed that it was difficult to determine the principal unit from the charm code, however all of the existing LMA charms have long been doing this. JUJU_PRINCIPAL is generally passed into hooks though I would note that the filebeat charm at least seemed to have to cache this as it might not always be available? I don't immediately see details of when it isn't: https://github.com/canonical/layer-beats-base/pull/26/files

It seems to me the real solution would be to label and name the metrics based on both the principal and subordinate, so that we can still filter for metrics on the principal, but the unique identifiers for rules etc would still be unique by having the subordinate also listed. In the case of 1 Machine with 2 Principals, we may duplicate some collections, if the data is collected twice with the two principal names, but the cardinality expansion should I think be limited, to only the number of Principal units - and we usually only have 1 maybe 2. It's very rare to have more than 2 Principal applications on the same machine. Assuming that we d

It wouldn't surprise me if I have missed something here, and I am pretty green to prometheus+loki, so please let me know what I have missed, it was a bit to get my head around. But I think that the key points are that

lathiat commented 7 months ago

If you read the above comment in an e-mail, I made a couple minor edits shortly after posting, so it would be best to read the latest version.

dstathis commented 7 months ago

So I would like to discuss logs and metrics separately.

Logs

Logs will support having both labels once #46 is completed. Grafana-agent will send a "standard" set of logs with it's own labels and the charm can request that specific logs files be labeled with its topology. Additionally, logs originating from snaps will get the labels of the charm that declared them.

Metrics

The metrics story is a bit different. We have decided to label any metrics generated by grafana-agent (node-exporter) with grafana-agent's topology and and metrics which we scrape from the application get the application's topology. So any dashboards or rules provided should work just fine with the application labels.

sed-i commented 7 months ago

Hi @lathiat,

Reproduction

To elaborate on @dstathis's response above, it would be handy if you could post a minimal ceph-osd bundle (e.g. juju export-bundle, trimmed down to the bare necessities). In the meanwhile, I think I was able to reproduce the issues in the following deployment.

# lxd model
series: jammy
saas:
  loki:
    url: microk8s:admin/pebnote.loki
  prom:
    url: microk8s:admin/pebnote.prom
applications:
  ga:
    charm: grafana-agent
    channel: edge
    revision: 52
  ub:
    charm: ubuntu
    channel: edge
    revision: 24
    num_units: 1
    to:
    - "0"
  ubu:
    charm: ubuntu
    channel: edge
    revision: 24
    num_units: 1
    to:
    - "0"
  zk:
    charm: zookeeper
    channel: 3/edge
    revision: 125
    num_units: 1
    to:
    - "1"
    trust: true
machines:
  "0":
    constraints: arch=amd64
  "1":
    constraints: arch=amd64
relations:
- - ga:juju-info
  - ub:juju-info
- - ga:juju-info
  - ubu:juju-info
- - ga:logging-consumer
  - loki:logging
- - ga:send-remote-write
  - prom:receive-remote-write
- - ga:cos-agent
  - zk:cos-agent
# microk8s model
bundle: kubernetes
saas:
  remote-a62e4e5eeec84aa78034f543c0218901: {}
applications:
  loki:
    charm: loki-k8s
    channel: edge
    revision: 121
    resources:
      loki-image: 91
    scale: 1
    trust: true
  prom:
    charm: prometheus-k8s
    channel: edge
    revision: 170
    resources:
      prometheus-image: 139
    scale: 1
    trust: true
relations:
- - loki:logging
  - remote-a62e4e5eeec84aa78034f543c0218901:logging-consumer
- - prom:receive-remote-write
  - remote-a62e4e5eeec84aa78034f543c0218901:send-remote-write
--- # overlay.yaml
applications:
  loki:
    offers:
      loki:
        endpoints:
        - logging
        acl:
          admin: admin
  prom:
    offers:
      prom:
        endpoints:
        - receive-remote-write
        acl:
          admin: admin

Relation view

graph LR
subgraph lxd
ub --- ga
ubu --- ga
zk --- ga
end

subgraph microk8s
prom
loki
end

ga --- prom
ga --- loki

Machine view

graph TD
subgraph machine-0
ub/0
ubu/0

subgraph subord1[subordiantes]
ga/0
ga/1
end

ub/0 --- ga/0
ubu/0 --- ga/1
end

subgraph machine-1
zk/0

subgraph subord2[subordiantes]
ga/2
end

zk/0 --- ga/2
end
dstathis commented 7 months ago

Just wanted to mention that scenario 2 above is only unsupported for now. We plan on supporting it in the future.

sed-i commented 7 months ago

Hi @lathiat, I went ahead and merged the associated PR to keep things well scoped. I know you're testing these changes so please open follow up issues for anything odd you encounter!

err404r commented 5 months ago

@sed-i Issue need to be reopened rev 88 on latest/edge has exactly the same issue as described here.