canonical / traefik-k8s-operator

This charmed operator automates the operational procedures of running Traefik, an open-source application proxy.
https://charmhub.io/traefik-k8s
Apache License 2.0
11 stars 25 forks source link

Permission denied error when providing ingress requirements in IPU #81

Open PietroPasotti opened 2 years ago

PietroPasotti commented 2 years ago

Bug Description

ModelErrors are fired when prometheus tries to access self.ingress.relation.data, because self.relation simply does self.relations[0] and, apparently, relations contains not one but two relations, the first of which is a ghost (possible juju bug?)

The second relation is the one we want.

To Reproduce

juju deploy prometheus-k8s --channel beta juju deploy traefik-k8s --channel edge juju relate prometheus-k8s:ingress traefik

juju remove-application traefik-k8s juju deploy traefik-k8s --channel edge --application-name='trfk' juju relate prometheus-k8s:ingress trfk

Environment

edge

Relevant log output

File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/lib/charms/traefik_k8s/v1/ingress_per_unit.py", line 70
3, in _handle_relation                                                                                          
    self._publish_auto_data(typing.cast(Relation, event.relation))                                              
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/lib/charms/traefik_k8s/v1/ingress_per_unit.py", line 71
1, in _publish_auto_data                                                                                        
    self.provide_ingress_requirements(host=self._host, port=self._port)                                         
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/lib/charms/traefik_k8s/v1/ingress_per_unit.py", line 75
1, in provide_ingress_requirements                                                                              
    self.relation.data[self.unit].update(data)                                                                  
  File "/usr/lib/python3.8/_collections_abc.py", line 832, in update
    self[key] = other[key]
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/model.py", line 938, in __setitem__
    self._backend.relation_set(self.relation.id, key, value, self._is_app)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/model.py", line 2137, in relation_set
    return self._run(*args)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/model.py", line 2036, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: b'ERROR cannot read relation settings: permission denied\n'


### Additional context

unclear whether it's a juju bug or not
rbarry82 commented 2 years ago

I haven't actually tried reproducing these steps, but my spidey sense would be that the first application hasn't finished removing by the time the second is related

PietroPasotti commented 2 years ago

I haven't actually tried reproducing these steps, but my spidey sense would be that the first application hasn't finished removing by the time the second is related

Mmmh, that would be possible. However, as far as I can tell, the relation was gone (juju status does not show it, and the whole application was also "gone" in the same sense). I do realize that juju might think otherwise though... My common sense would suggest that if accessing the data gives an error (the data is gone), then so should the relation. That would be an inconsistency in juju, that should unlist the relation before it gets rid of the data (or 'simultaneously')

rbarry82 commented 2 years ago

I agree that it could be a Juju bug. Last time we encountered something similar (applications which were "stuck" and could not be removed in some scenarios), I went through a lot of trace logging in Juju, dumped the database in a bad state, etc. Ultimately, it's Mongo, and there aren't any cascading deletes or strict referential integrity. Juju refcounts inside Mongo documents to know when it's safe to remove an object, and it's spread across a couple of documents.

The trace logging in Juju is.. a lot, and I'm not a Juju developer, so determining exactly which loggers I needed to enable was a little bit trial and error, and it's been a couple of months. That particular exception says to me "that relation data still exists, but your application isn't marked as part of that relation, so go away". Either because of the async nature of the way things are handled (Juju uses its own transaction queue for Mongo to provide assurances around data integrity, so "remove this relation data as part of cleanup" may have been queued up as part of an operation where the tombstone was set to dying but we can't get rid of the relation itself until we send these events), especially given that relation-broken doesn't provide any contract around whether data should exist that I can remember. The wording of "as if this relation never existed" implies no.

Either way, it would not be the first "ghost/zombie" we've seen in Juju if something went wrong there.