canonical / cos-configuration-k8s-operator

https://charmhub.io/cos-configuration-k8s
Apache License 2.0
4 stars 6 forks source link

possible race condition in the charm #84

Open nobuto-m opened 7 months ago

nobuto-m commented 7 months ago

Bug Description

It's a race condition in a deployment and the charm can turn into an error state.

unit-cos-configuration-ceph-0: 13:06:39 INFO juju.worker.uniter found queued "install" hook
unit-cos-configuration-ceph-0: 13:06:40 DEBUG unit.cos-configuration-ceph/0.juju-log ops 2.4.1 up and running.
unit-cos-configuration-ceph-0: 13:06:40 INFO unit.cos-configuration-ceph/0.juju-log Running legacy hooks/install.
unit-cos-configuration-ceph-0: 13:06:40 DEBUG unit.cos-configuration-ceph/0.juju-log ops 2.4.1 up and running.
unit-cos-configuration-ceph-0: 13:06:40 DEBUG unit.cos-configuration-ceph/0.juju-log Charm called itself via hooks/install.
unit-cos-configuration-ceph-0: 13:06:40 DEBUG unit.cos-configuration-ceph/0.juju-log Legacy hooks/install exited with status 0.
unit-cos-configuration-ceph-0: 13:06:41 ERROR unit.cos-configuration-ceph/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-cos-configuration-ceph-0/charm/venv/ops/model.py", line 2693, in _run
    result = subprocess.run(args, **kwargs)  # type: ignore
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-cos-configuration-ceph-0/storage-get', '-s', 'content-from-git/1', 'location', '--format=json')' returned non-zero exit status 2.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 473, in <module>
    main(COSConfigCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-cos-configuration-ceph-0/charm/venv/ops/main.py", line 429, in main
    charm = charm_class(framework)
  File "./src/charm.py", line 84, in __init__
    self._git_sync_mount_point = self.model.storages["content-from-git"][0].location
  File "/var/lib/juju/agents/unit-cos-configuration-ceph-0/charm/venv/ops/model.py", line 1792, in location
    raw = self._backend.storage_get(self.full_id, "location")
  File "/var/lib/juju/agents/unit-cos-configuration-ceph-0/charm/venv/ops/model.py", line 2919, in storage_get
    out = self._run('storage-get', '-s', storage_name_id, attribute,
  File "/var/lib/juju/agents/unit-cos-configuration-ceph-0/charm/venv/ops/model.py", line 2695, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: ERROR invalid value "content-from-git/1" for option -s: getting filesystem attachment info: filesystem attachment "1" on "unit cos-configuration-ceph/0" not provisioned

unit-cos-configuration-ceph-0: 13:06:41 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1

To Reproduce

  1. juju deploy cos-lite --trust --channel latest/edge
  2. juju deploy cos-configuration-k8s cos-configuration-ceph --config ...

Environment

prometheus-scrape-config-k8s latest/edge 47

Relevant log output

$ juju show-status-log cos-configuration-ceph/0 --days 1
Time                   Type       Status       Message
01 Mar 2024 13:06:00Z  juju-unit  allocating   
01 Mar 2024 13:06:00Z  workload   waiting      installing agent
01 Mar 2024 13:06:34Z  workload   waiting      agent initialising
01 Mar 2024 13:06:39Z  juju-unit  executing    running install hook
01 Mar 2024 13:06:41Z  juju-unit  error        hook failed: "install"
01 Mar 2024 13:06:46Z  workload   maintenance  installing charm software
01 Mar 2024 13:06:46Z  juju-unit  executing    running install hook
01 Mar 2024 13:06:50Z  juju-unit  executing    running prometheus-config-relation-created hook
01 Mar 2024 13:06:51Z  juju-unit  executing    running replicas-relation-created hook
01 Mar 2024 13:06:51Z  juju-unit  executing    running grafana-dashboards-relation-created hook
01 Mar 2024 13:06:52Z  juju-unit  executing    running leader-elected hook
01 Mar 2024 13:07:27Z  juju-unit  executing    running git-sync-pebble-ready hook
01 Mar 2024 13:07:29Z  juju-unit  executing    running content-from-git-storage-attached hook
01 Mar 2024 13:07:30Z  juju-unit  executing    running config-changed hook
01 Mar 2024 13:07:32Z  juju-unit  executing    running start hook
01 Mar 2024 13:07:34Z  juju-unit  executing    running grafana-dashboards-relation-joined hook for grafana/0
01 Mar 2024 13:07:36Z  juju-unit  executing    running grafana-dashboards-relation-changed hook for grafana/0
01 Mar 2024 13:07:37Z  juju-unit  executing    running replicas-relation-changed hook
01 Mar 2024 13:07:38Z  juju-unit  idle         
01 Mar 2024 13:07:49Z  juju-unit  executing    running prometheus-config-relation-joined hook for prometheus/0
01 Mar 2024 13:07:51Z  juju-unit  executing    running prometheus-config-relation-changed hook for prometheus/0
01 Mar 2024 13:08:05Z  juju-unit  idle         
01 Mar 2024 14:11:38Z  workload   active

Additional context

No response

simskij commented 4 months ago

Do you have any storage enabled on your k8s? This bug makes it sound like you possibly don't.

nobuto-m commented 4 months ago

It was based on the cos-lite tutorial so hostpath storage was used.

As I stated it looked like a race condition.

theoctober19th commented 1 month ago

Hi guys,

I've been affected by this issue as well when running integration tests that use a bundle that includes cos-configuration-k8. The error I receive is very similar:

Traceback (most recent call last):
  File "./src/charm.py", line 473, in <module>
    main(COSConfigCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-cos-configuration-0/charm/venv/ops/main.py", line 429, in main
    charm = charm_class(framework)
  File "./src/charm.py", line 84, in __init__
    self._git_sync_mount_point = self.model.storages["content-from-git"][0].location
  File "/var/lib/juju/agents/unit-cos-configuration-0/charm/venv/ops/model.py", line 1792, in location
    raw = self._backend.storage_get(self.full_id, "location")
  File "/var/lib/juju/agents/unit-cos-configuration-0/charm/venv/ops/model.py", line 2919, in storage_get
    out = self._run('storage-get', '-s', storage_name_id, attribute,
  File "/var/lib/juju/agents/unit-cos-configuration-0/charm/venv/ops/model.py", line 2695, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: ERROR invalid value "content-from-git/1" for option -s: getting filesystem attachment info: filesystem attachment "1" on "unit cos-configuration/0" not provisioned

With a little bit of research and some help from the Observability team's Matrix channel, I think the problem occurs due to install hook being fired before the storage has been attached. In this case, the charm would not be able to access the storage and thus raise an exception.

I also found a similar issue in vault-k8s and they have fixed this by try-catch logic around accessing the storage.