canonical / grafana-k8s-operator

https://charmhub.io/grafana-k8s
Apache License 2.0
6 stars 22 forks source link

Charm goes into error state on removal #138

Closed sed-i closed 1 year ago

sed-i commented 2 years ago

Bug Description

When I destroy the COS Lite model, grafana goes into error state:

App           Version  Status   Scale  Charm             Channel  Rev  Address         Exposed  Message
grafana       8.2.6    error        1  grafana-k8s       edge      45  10.152.183.227  no       hook failed: "grafana-dashboard-relation-broken"

Unit        Workload  Agent  Address      Ports  Message
grafana/0*  error     idle   10.1.13.151         hook failed: "grafana-dashboard-relation-broken" for alertmanager:grafana-dashboard

To Reproduce

  1. Deploy the COS Lite bundle (without traefik, in the meanwhile).
  2. juju destroy-model --destroy-storage.

Environment

NTA.

Relevant log output

Traceback (most recent call last):
  File "./src/charm.py", line 1153, in <module>
    main(GrafanaCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event)  # noqa
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1311, in _on_grafana_dashboard_relation_broken
    self._remove_all_dashboards_for_relation(event.relation)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1456, in _remove_all_dashboards_for_relation
    if self._get_stored_dashboards(relation.id):
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1489, in _get_stored_dashboards
    return self.get_peer_data("dashboards").get(str(relation_id), {})
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1504, in get_peer_data
    data = self._charm.peers.data[self._charm.app].get(key, "")  # type: ignore[attr-defined]
AttributeError: 'NoneType' object has no attribute 'data'
unit-grafana-0: 10:29:11.019 ERROR juju.worker.uniter.operation hook "grafana-source-relation-departed" (via hook dispatching script: dispatch) failed: exit status 1
unit-grafana-0: 10:29:10.013 DEBUG unit.grafana/0.juju-log grafana-source:12: Operator Framework 1.5.3+1.g26626e4 up and running.
unit-grafana-0: 10:29:10.156 DEBUG unit.grafana/0.juju-log grafana-source:12: Re-emitting <GrafanaDashboardsChanged via GrafanaCharm/GrafanaDashboardConsumer[grafana-dashboard]/on/dashboards_changed[237]>.
unit-grafana-0: 10:29:10.169 DEBUG unit.grafana/0.juju-log grafana-source:12: Pebble API is not ready; ConnectionError: [Errno 2] No such file or directory
unit-grafana-0: 10:29:10.182 INFO unit.grafana/0.juju-log grafana-source:12: Initializing dashboard provisioning path
unit-grafana-0: 10:29:10.196 WARNING unit.grafana/0.juju-log grafana-source:12: Could not push default dashboard configuration. Pebble shutting down?
unit-grafana-0: 10:29:10.209 DEBUG unit.grafana/0.juju-log grafana-source:12: Pebble API is not ready; ConnectionError: [Errno 2] No such file or directory
unit-grafana-0: 10:29:10.221 DEBUG unit.grafana/0.juju-log grafana-source:12: Cannot connect to Pebble yet, deferring event
unit-grafana-0: 10:29:10.233 DEBUG unit.grafana/0.juju-log grafana-source:12: Deferring <GrafanaDashboardsChanged via GrafanaCharm/GrafanaDashboardConsumer[grafana-dashboard]/on/dashboards_changed[237]>.
unit-grafana-0: 10:29:10.260 DEBUG unit.grafana/0.juju-log grafana-source:12: Re-emitting <GrafanaDashboardsChanged via GrafanaCharm/GrafanaDashboardConsumer[grafana-dashboard]/on/dashboards_changed[241]>.
unit-grafana-0: 10:29:10.273 DEBUG unit.grafana/0.juju-log grafana-source:12: Pebble API is not ready; ConnectionError: [Errno 2] No such file or directory
unit-grafana-0: 10:29:10.285 INFO unit.grafana/0.juju-log grafana-source:12: Initializing dashboard provisioning path
unit-grafana-0: 10:29:10.299 WARNING unit.grafana/0.juju-log grafana-source:12: Could not push default dashboard configuration. Pebble shutting down?
unit-grafana-0: 10:29:10.311 DEBUG unit.grafana/0.juju-log grafana-source:12: Pebble API is not ready; ConnectionError: [Errno 2] No such file or directory
unit-grafana-0: 10:29:10.323 DEBUG unit.grafana/0.juju-log grafana-source:12: Cannot connect to Pebble yet, deferring event
unit-grafana-0: 10:29:10.335 DEBUG unit.grafana/0.juju-log grafana-source:12: Deferring <GrafanaDashboardsChanged via GrafanaCharm/GrafanaDashboardConsumer[grafana-dashboard]/on/dashboards_changed[241]>.
unit-grafana-0: 10:29:10.359 DEBUG unit.grafana/0.juju-log grafana-source:12: Re-emitting <GrafanaDashboardsChanged via GrafanaCharm/GrafanaDashboardConsumer[grafana-dashboard]/on/dashboards_changed[245]>.

Additional context

No response

rbarry82 commented 2 years ago

@sed-i I can't reproduce this, and the log doesn't make any sense. Was this done with --force? The error is actually grafana-source also, but it somehow looks like there is no peer data relation at all, and the peer data bag should absolutely not be getting cleaned up while the charm still exists.

If you can reproduce this, can you capture the events and/or the whole log? This doesn't provide enough information other than a guess that the peer relation was somehow broken/departed before the other events fired.

sed-i commented 2 years ago

Can't reproduce on 2cpu7gb + juju 2.9.29. Closing.

sed-i commented 1 year ago

This happened again with grafana rev. 53 on 4cpu-8gb.

  1. Deploy the cos bundle
  2. juju remove-application --destroy-storage grafana
unit-grafana-0: 18:46:31 DEBUG unit.grafana/0.juju-log grafana-source:9: Emitting Juju event grafana_source_relation_departed.
unit-grafana-0: 18:46:31 DEBUG unit.grafana/0.juju-log grafana-source:9: Removing all data for relation: 9
unit-grafana-0: 18:46:31 ERROR unit.grafana/0.juju-log grafana-source:9: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 1163, in <module>
    main(GrafanaCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event)  # noqa
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_source.py", line 608, in _on_grafana_source_relation_departed
    removed_source = self._remove_source_from_datastore(event)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_source.py", line 623, in _remove_source_from_datastore
    stored_sources = self.get_peer_data("sources")
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_source.py", line 722, in get_peer_data
    data = self._charm.peers.data[self._charm.app].get(key, "")  # type: ignore[attr-defined]
AttributeError: 'NoneType' object has no attribute 'data'
unit-grafana-0: 18:46:32 ERROR juju.worker.uniter.operation hook "grafana-source-relation-departed" (via hook dispatching script: dispatch) failed: exit status 1
rbarry82 commented 1 year ago

This is exactly the same as the last one. This traceback is insufficient, and it still looks like a Juju bug. The peer relation should never be done. I still can't reproduce this. If you can, please capture the state of the model/application (including all relations, including peer relations) and submit it.

sed-i commented 1 year ago

So juju status is stuck on 0/1 for all apps

Model  Controller  Cloud/Region        Version  SLA          Timestamp       Notes
m8     chdv2934    microk8s/localhost  2.9.34   unsupported  13:16:25-05:00  attempt 13 to destroy model failed (will retry):  model not empty, found 5 applications (model not empty)

App         Version  Status      Scale  Charm           Channel  Rev  Address         Exposed  Message
catalogue            active        0/1  catalogue-k8s   edge       4  10.152.183.88   no       
grafana     9.2.1    terminated    0/1  grafana-k8s                0  10.152.183.62   no       unit stopped by the cloud
loki                 unknown       0/1  loki-k8s        edge      47  10.152.183.212  no       
prometheus           unknown       0/1  prometheus-k8s             0  10.152.183.19   no       
traefik              unknown       0/1  traefik-k8s     edge      93  192.168.1.10    no       

Unit       Workload  Agent  Address     Ports  Message
grafana/0  unknown   lost   10.1.55.13         agent lost, see 'juju show-status-log grafana/0'

Relation provider             Requirer                     Interface           Type     Message
catalogue:catalogue           grafana:catalogue            catalogue           regular  
grafana:metrics-endpoint      prometheus:metrics-endpoint  prometheus_scrape   regular  
loki:grafana-source           grafana:grafana-source       grafana_datasource  regular  
prometheus:grafana-dashboard  grafana:grafana-dashboard    grafana_dashboard   regular  
prometheus:grafana-source     grafana:grafana-source       grafana_datasource  regular  
traefik:traefik-route         grafana:ingress              traefik_route       regular  

but there's nothing left:

$ k get pods -n m8
NAME                             READY   STATUS    RESTARTS   AGE
modeloperator-695c98c5f8-t22ps   1/1     Running   0          59m

When I forcefully remove grafana, it all unlocks clears out.

rbarry82 commented 1 year ago

Ok. But can you please check the requested information? This shows a missing peer relation in at the bottom, but can you get the raw model data, and app data for Grafana?

sed-i commented 1 year ago

Collected show-application, show-unit and status, before and after running destroy-model --destroy-storage. status.zip

rbarry82 commented 1 year ago

Before:

{
    "grafana": {                                             
      "charm": "local:focal/grafana-k8s-0",
      "series": "kubernetes",                                
      "os": "kubernetes",                                    
      "charm-origin": "local",                               
      "charm-name": "grafana-k8s",
      "charm-rev": 0,                                        
      "scale": 1,                                            
      "provider-id": "49331306-9bd9-422a-9792-3c0203e449ba",
      "address": "10.152.183.127",
      "exposed": false,                                      
      "application-status": {                                
        "current": "active",                                 
        "since": "23 Nov 2022 13:26:33-05:00"
      },                                                     
      "relations": {                                         
        "catalogue": [                                       
          "catalogue"                                        
        ],                                                   
        "grafana": [  # <---- HERE                                       
          "grafana"                                          
        ],                                                   
        "grafana-dashboard": [                               
          "alertmanager",                                    
          "loki",                                            
          "prometheus"                                       
        ],                                                   
        "grafana-source": [                                  
          "alertmanager",                                    
          "loki",                                            
          "prometheus"                                       
        ],                                                   
        "ingress": [                                         
          "traefik"                                          
        ],                                                   
        "metrics-endpoint": [                                
          "prometheus"                                       
        ]                                                    
      },                                                     
      # ...
}

After:

    "grafana": {
      "charm": "local:focal/grafana-k8s-0",
      "series": "kubernetes",
      "os": "kubernetes",
      "charm-origin": "local",
      "charm-name": "grafana-k8s",
      "charm-rev": 0,
      "scale": 1,
      "provider-id": "49331306-9bd9-422a-9792-3c0203e449ba",
      "address": "10.152.183.127",
      "exposed": false,
      "life": "dying",
      "application-status": {
        "current": "error",
        "message": "hook failed: \"grafana-dashboard-relation-broken\"",
        "since": "23 Nov 2022 16:39:29-05:00"
      },
      "relations": {
        "catalogue": [
          "catalogue"
        ],
        "grafana-dashboard": [
          "alertmanager",
          "loki",
          "prometheus"
        ],
        "grafana-source": [
          "alertmanager",
          "loki",
          "prometheus"
        ],
        "ingress": [
          "traefik"
        ],
        "metrics-endpoint": [
          "prometheus"
        ]
      },
      ...

There is, in fact, no peer relation at all. This should never happen. This is not a Grafana bug. Let's take this to Launchpad/Juju, because either the contract changed, or there is a bug there, because "the charm departed the peer relation before everything else" did not happen for 9 months prior to this, there haven't been any changelogs about it, and it violates fundamental assumptions.

sed-i commented 1 year ago

Posted here: https://bugs.launchpad.net/juju/+bug/1998282