Closed Osama-Kassem closed 5 months ago
I am observing a similar issue on gh runners.
Openfga-k8s gets in error state with message openfga-k8s/0 [idle] error: crash loop backoff: back-off 2m40s restarting failed
when postgres is taking long to spin up. After postgres turns active, openfga gets stuck waiting to connect to its container, failing the "update-status" hook:
unit-openfga-k8s-0: 16:31:46 ERROR unit.openfga-k8s/0.juju-log openfga:14: openfga is not running
<...>
unit-openfga-k8s-0: 16:42:49 ERROR unit.openfga-k8s/0.juju-log Cannot connect to container openfga
unit-openfga-k8s-0: 16:42:49 ERROR unit.openfga-k8s/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 625, in <module>
main(OpenFGAOperatorCharm)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/main.py", line 456, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/framework.py", line 351, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/framework.py", line 853, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/framework.py", line 943, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 211, in _on_update_status
self._ready()
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 448, in _ready
if self._migration_is_needed():
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 424, in _migration_is_needed
return getattr(self._state, key, None) != self.openfga.get_version()
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/src/openfga.py", line 39, in get_version
_, stderr = self._run_cmd(cmd)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/src/openfga.py", line 84, in _run_cmd
process = self.container.exec(cmd, stdin=input_, environment=environment, timeout=timeout)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/model.py", line 2719, in exec
return self._pebble.exec(
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/pebble.py", line 2600, in exec
resp = self._request('POST', '/v1/exec', body=body)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/pebble.py", line 1754, in _request
response = self._request_raw(method, path, query, headers, data)
File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/pebble.py", line 1803, in _request_raw
raise ConnectionError(
ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/openfga/pebble.socket' (container restarted?)
unit-openfga-k8s-0: 16:42:49 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1
This was only observed on github runners (see https://github.com/canonical/identity-platform-admin-ui-operator/actions/runs/8051508521/job/21991817378), but wasn't reproducible on an ec2 t2.medium instance. This might be a race condition as waiting for openfga and postgres to get active before adding other relations fixed the above issue.
@natalian98 What channel are you deploying the charm from? The original issue was using the charm from latest
whereas if you are running 1.0
you might be facing a separate issue.
@kian99 I'm deploying the latest/edge
version
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/IAM-706.
This message was autogenerated
I tried reproducing it again - the charm is not getting into error state anymore when deploying from latest/edge
, but I'm observing charms slowness on juju agent version 3.1.0.
In that case, when postgres is taking a long time while awaiting for primary endpoint to be ready
, openfga gets blocked with message Please run schema-upgrade action
. It eventually gets active without intervention (see logs).
This agent version might be broken as the deployment is smooth on 3.1.7.
To reproduce:
juju deploy openfga-k8s --channel edge
juju deploy postgresql-k8s --channel 14/stable
juju relate openfga-k8s postgresql-k8s
This should be fixed by https://github.com/canonical/openfga-operator/pull/42
When deploying the charm using Terraform, I noticed that if I deploy and relate the charm in the same
terraform apply
command, the charm would be permanently stuck in an error state.After some investigation (thanks @kian99) it seems that the charm
on-relation-changed
hook is fired and tries to call the openFGA app, which may or may not have been started yet.If openFGA is deployed, and juju is given enough time, then the relation is added, the error does not happen.