canonical / grafana-agent-k8s-operator

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.
https://charmhub.io/grafana-agent-k8s
Apache License 2.0
8 stars 18 forks source link

Grafana agent fails with hook failed: "send-remote-write-relation-joined" #182

Closed Sponge-Bas closed 8 months ago

Sponge-Bas commented 1 year ago

Bug Description

In SQA testun https://solutions.qa.canonical.com/v2/testruns/d6241418-2e99-4c93-95c0-51aad671b834, the grafana-agent unit fails with:

2023-04-25 15:31:21 DEBUG unit.zookeeper-agent/1.juju-log server.go:316 send-remote-write:17: Emitting custom event <PrometheusRemoteWriteEndpointsChangedEvent via GrafanaAgentMachineCharm/PrometheusRemoteWriteConsumer[send-remote-write]/on/endpoints_changed[68]>.
2023-04-25 15:31:21 ERROR unit.zookeeper-agent/1.juju-log server.go:316 send-remote-write:17: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/operator_libs_linux/v1/snap.py", line 309, in _snap_daemons
    return subprocess.run(_cmd, universal_newlines=True, check=True, capture_output=True)
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['snap', 'restart', 'grafana-agent']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/./src/charm.py", line 286, in restart
    self.snap.restart()
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/operator_libs_linux/v1/snap.py", line 424, in restart
    self._snap_daemons(args, services)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/operator_libs_linux/v1/snap.py", line 311, in _snap_daemons
    raise SnapError("Could not {} for snap [{}]: {}".format(_cmd, self._name, e.stderr))
charms.operator_libs_linux.v1.snap.SnapError: Could not ['snap', 'restart', 'grafana-agent'] for snap [grafana-agent]: error: cannot perform the following tasks:
- Run service command "restart" for services ["grafana-agent"] of snap "grafana-agent" (systemctl command [start snap.grafana-agent.grafana-agent.service] failed with exit status 1: Job for snap.grafana-agent.grafana-agent.service failed because the control process exited with error code.
See "systemctl status snap.grafana-agent.grafana-agent.service" and "journalctl -xeu snap.grafana-agent.grafana-agent.service" for details.
)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/./src/charm.py", line 482, in <module>
    main(GrafanaAgentMachineCharm)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/lib/charms/prometheus_k8s/v0/prometheus_remote_write.py", line 673, in _handle_endpoints_changed
    self.on.endpoints_changed.emit(relation_id=event.relation.id)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 354, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 830, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/venv/ops/framework.py", line 919, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/src/grafana_agent.py", line 333, in on_remote_write_changed
    self._update_config()
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/src/grafana_agent.py", line 403, in _update_config
    self.restart()
  File "/var/lib/juju/agents/unit-zookeeper-agent-1/charm/./src/charm.py", line 288, in restart
    raise GrafanaAgentServiceError("Failed to restart grafana-agent") from e
GrafanaAgentServiceError: Failed to restart grafana-agent
2023-04-25 15:31:21 ERROR juju.worker.uniter.operation runhook.go:153 hook "send-remote-write-relation-joined" (via hook dispatching script: dispatch) failed: exit status 1

To Reproduce

Deploy the kafka bundle and relate it to cos on microk8s

Environment

Model       Controller        Cloud/Region        Version  SLA          Timestamp
controller  foundations-maas  maas_cloud/default  2.9.42   unsupported  15:31:46Z

Machine  State    Address         Inst id  Series  AZ     Message
0        started  10.246.164.122  juju1-7  focal   zone1  Deployed
1        started  10.246.167.8    juju1-8  focal   zone2  Deployed
2        started  10.246.166.182  juju2-9  focal   zone3  Deployed
Model  Controller        Cloud/Region        Version  SLA          Timestamp
kafka  foundations-maas  maas_cloud/default  2.9.42   unsupported  15:31:47Z

SAAS          Status  Store         URL
alertmanager  active  popocatepetl  admin/cos.alertmanager
grafana       active  popocatepetl  admin/cos.grafana
loki          active  popocatepetl  admin/cos.loki
prometheus    active  popocatepetl  admin/cos.prometheus

App                        Version  Status       Scale  Charm                      Channel           Rev  Exposed  Message
kafka                               blocked          3  kafka                      latest/edge       114  no       missing required zookeeper relation
kafka-agent                         maintenance      2  grafana-agent              latest/edge         8  no       Installing grafana-agent snap
ntp                        4.2      active           2  ntp                        latest/candidate   50  no       chrony: Ready
tls-certificates-operator           active           1  tls-certificates-operator  latest/edge        23  no       
zookeeper                           active           3  zookeeper                  latest/edge        98  no       
zookeeper-agent                     error            3  grafana-agent              latest/edge         8  no       hook failed: "send-remote-write-relation-joined"

Unit                          Workload     Agent      Machine  Public address  Ports    Message
kafka/0                       maintenance  executing  0        10.246.166.210           (install) installing charm software
kafka/1                       blocked      executing  1        10.246.167.131           missing required zookeeper relation
  kafka-agent/0*              maintenance  executing           10.246.167.131           (install) Installing grafana-agent snap
  ntp/0*                      active       executing           10.246.167.131  123/udp  (install) chrony: Ready
kafka/2*                      blocked      executing  2        10.246.165.79            missing required zookeeper relation
  kafka-agent/1               maintenance  executing           10.246.165.79            (install) Installing grafana-agent snap
  ntp/1                       active       executing           10.246.165.79   123/udp  (install) chrony: Ready
tls-certificates-operator/0*  active       idle       3        10.246.166.101           
zookeeper/0                   active       idle       4        10.246.167.32            
  zookeeper-agent/1           error        idle                10.246.167.32            hook failed: "send-remote-write-relation-joined"
zookeeper/1*                  active       executing  5        10.246.167.17            
  zookeeper-agent/0*          active       idle                10.246.167.17            
zookeeper/2                   active       idle       6        10.246.165.50            
  zookeeper-agent/2           active       idle                10.246.165.50            

Machine  State    Address         Inst id             Series  AZ     Message
0        started  10.246.166.210  vault2-5            jammy   zone1  Deployed
1        started  10.246.167.131  grafana2-3          jammy   zone3  Deployed
2        started  10.246.165.79   landscapeha-23-2-2  jammy   zone2  Deployed
3        started  10.246.166.101  vault1-7            jammy   zone1  Deployed
4        started  10.246.167.32   grafana1-3          jammy   zone3  Deployed
5        started  10.246.167.17   landscapeha-23-1-2  jammy   zone2  Deployed
6        started  10.246.165.50   microk8s1-4         jammy   zone1  Deployed
Model           Controller        Cloud/Region              Version  SLA          Timestamp
metallb-system  foundations-maas  microk8s_cloud/localhost  2.9.42   unsupported  15:31:48Z

App                 Version                         Status  Scale  Charm               Channel  Rev  Address         Exposed  Message
metallb-controller  res:metallb-controller-imag...  active      1  metallb-controller  stable    41  10.152.183.173  no       
metallb-speaker     res:metallb-speaker-image@6...  active      3  metallb-speaker     stable    36  10.152.183.144  no       

Unit                   Workload  Agent  Address         Ports     Message
metallb-controller/0*  active    idle   10.1.168.196    7472/TCP  
metallb-speaker/0*     active    idle   10.246.165.172  7472/TCP  
metallb-speaker/1      active    idle   10.246.165.74   7472/TCP  
metallb-speaker/2      active    idle   10.246.167.42   7472/TCP  
Model     Controller        Cloud/Region        Version  SLA          Timestamp
microk8s  foundations-maas  maas_cloud/default  2.9.42   unsupported  15:31:48Z

App       Version  Status  Scale  Charm     Channel  Rev  Exposed  Message
microk8s           active      3  microk8s  stable    35  yes      

Unit         Workload  Agent  Machine  Public address  Ports                     Message
microk8s/0*  active    idle   0        10.246.165.74   80/tcp,443/tcp,16443/tcp  
microk8s/1   active    idle   1        10.246.165.172  80/tcp,443/tcp,16443/tcp  
microk8s/2   active    idle   2        10.246.167.42   80/tcp,443/tcp,16443/tcp  

Machine  State    Address         Inst id      Series  AZ     Message
0        started  10.246.165.74   microk8s1-1  jammy   zone1  Deployed
1        started  10.246.165.172  microk8s1-2  jammy   zone2  Deployed
2        started  10.246.167.42   microk8s1-3  jammy   zone3  Deployed
Model  Controller    Cloud/Region              Version  SLA          Timestamp
cos    popocatepetl  microk8s_cloud/localhost  2.9.42   unsupported  15:31:49Z

App           Version  Status  Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.23.0   active      1  alertmanager-k8s  stable    47  10.152.183.150  no       
catalogue              active      1  catalogue-k8s     stable    13  10.152.183.94   no       
grafana       9.2.1    active      1  grafana-k8s       stable    64  10.152.183.192  no       
loki          2.4.1    active      1  loki-k8s          stable    60  10.152.183.41   no       
prometheus    2.33.5   active      1  prometheus-k8s    stable   103  10.152.183.184  no       
traefik       2.9.6    active      1  traefik-k8s       stable   110  10.246.167.226  no       

Unit             Workload  Agent      Address      Ports  Message
alertmanager/0*  active    idle       10.1.107.13         
catalogue/0*     active    idle       10.1.166.7          
grafana/0*       active    executing  10.1.166.13         
loki/0*          active    idle       10.1.107.14         
prometheus/0*    active    executing  10.1.166.12         
traefik/0*       active    idle       10.1.107.12         

Offer         Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager  alertmanager  alertmanager-k8s  47   0/0        karma-dashboard       karma_dashboard          provider
grafana       grafana       grafana-k8s       64   2/2        grafana-dashboard     grafana_dashboard        requirer
loki          loki          loki-k8s          60   2/2        logging               loki_push_api            provider
prometheus    prometheus    prometheus-k8s    103  2/2        metrics-endpoint      prometheus_scrape        requirer
                                                              receive-remote-write  prometheus_remote_write  provider

Relevant log output

See Bug Description, crashdumps and other configs can be found [here](https://oil-jenkins.canonical.com/artifacts/d6241418-2e99-4c93-95c0-51aad671b834/index.html)

Additional context

No response

lucabello commented 8 months ago

This bug was found with revision 8 of the machine charm; currently, latest/stable hold revision 20.

I couldn't reproduce the bug, so I think it's been fixed :) Closing the issue, but feel free to post again or open a new one if this happens again!