canonical / openfga-operator

Charmed OpenFGA
https://charmhub.io/openfga-k8s
Apache License 2.0
1 stars 5 forks source link

Race condition when relating quickly after deploying #25

Closed Osama-Kassem closed 5 months ago

Osama-Kassem commented 1 year ago

When deploying the charm using Terraform, I noticed that if I deploy and relate the charm in the same terraform apply command, the charm would be permanently stuck in an error state.

After some investigation (thanks @kian99) it seems that the charm on-relation-changed hook is fired and tries to call the openFGA app, which may or may not have been started yet.

If openFGA is deployed, and juju is given enough time, then the relation is added, the error does not happen.

natalian98 commented 7 months ago

I am observing a similar issue on gh runners. Openfga-k8s gets in error state with message openfga-k8s/0 [idle] error: crash loop backoff: back-off 2m40s restarting failedwhen postgres is taking long to spin up. After postgres turns active, openfga gets stuck waiting to connect to its container, failing the "update-status" hook:

unit-openfga-k8s-0: 16:31:46 ERROR unit.openfga-k8s/0.juju-log openfga:14: openfga is not running
<...>
unit-openfga-k8s-0: 16:42:49 ERROR unit.openfga-k8s/0.juju-log Cannot connect to container openfga
unit-openfga-k8s-0: 16:42:49 ERROR unit.openfga-k8s/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 625, in <module>
    main(OpenFGAOperatorCharm)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/main.py", line 456, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/framework.py", line 351, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/framework.py", line 853, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/framework.py", line 943, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 211, in _on_update_status
    self._ready()
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 448, in _ready
    if self._migration_is_needed():
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/./src/charm.py", line 424, in _migration_is_needed
    return getattr(self._state, key, None) != self.openfga.get_version()
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/src/openfga.py", line 39, in get_version
    _, stderr = self._run_cmd(cmd)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/src/openfga.py", line 84, in _run_cmd
    process = self.container.exec(cmd, stdin=input_, environment=environment, timeout=timeout)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/model.py", line 2719, in exec
    return self._pebble.exec(
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/pebble.py", line 2600, in exec
    resp = self._request('POST', '/v1/exec', body=body)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/pebble.py", line 1754, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-openfga-k8s-0/charm/venv/ops/pebble.py", line 1803, in _request_raw
    raise ConnectionError(
ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/openfga/pebble.socket' (container restarted?)
unit-openfga-k8s-0: 16:42:49 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1

This was only observed on github runners (see https://github.com/canonical/identity-platform-admin-ui-operator/actions/runs/8051508521/job/21991817378), but wasn't reproducible on an ec2 t2.medium instance. This might be a race condition as waiting for openfga and postgres to get active before adding other relations fixed the above issue.

kian99 commented 7 months ago

@natalian98 What channel are you deploying the charm from? The original issue was using the charm from latest whereas if you are running 1.0 you might be facing a separate issue.

natalian98 commented 7 months ago

@kian99 I'm deploying the latest/edge version

syncronize-issues-to-jira[bot] commented 7 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/IAM-706.

This message was autogenerated

natalian98 commented 5 months ago

I tried reproducing it again - the charm is not getting into error state anymore when deploying from latest/edge, but I'm observing charms slowness on juju agent version 3.1.0. In that case, when postgres is taking a long time while awaiting for primary endpoint to be ready, openfga gets blocked with message Please run schema-upgrade action. It eventually gets active without intervention (see logs). This agent version might be broken as the deployment is smooth on 3.1.7.

To reproduce:

juju deploy openfga-k8s --channel edge
juju deploy postgresql-k8s --channel 14/stable
juju relate openfga-k8s postgresql-k8s
nsklikas commented 5 months ago

This should be fixed by https://github.com/canonical/openfga-operator/pull/42