ModelError: ERROR invalid value [...] filesystem attachment [...] not provisioned

gruyaume commented 1 week ago

Bug Description

We get an ops.model.ModelError error in integration tests once in a while. Please look at the run example for more details:

https://github.com/canonical/vault-k8s-operator/actions/runs/9669432978/job/26676238946?pr=406

To Reproduce

Run integration tests a couple of time

Environment

MicroK8s: v1.29.5
Juju: 3.4.3

Relevant log output

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-b-1/charm/./src/charm.py", line 1588, in <module>
    main(VaultCharm)
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/main.py", line 548, in main
    manager.run()
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/main.py", line 527, in run
    self._emit()
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/main.py", line 518, in _emit
    ops.charm._evaluate_status(self.charm)
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/charm.py", line 1255, in _evaluate_status
    charm.on.collect_unit_status.emit()
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-vault-b-1/charm/./src/charm.py", line 331, in _on_collect_status
    if not self.tls.tls_file_available_in_charm(File.CA):
  File "/var/lib/juju/agents/unit-vault-b-1/charm/lib/charms/vault_k8s/v0/vault_tls.py", line 351, in tls_file_available_in_charm
    file_path = self.get_tls_file_path_in_charm(file)
  File "/var/lib/juju/agents/unit-vault-b-1/charm/lib/charms/vault_k8s/v0/vault_tls.py", line 339, in get_tls_file_path_in_charm
    storage_location = cert_storage.location
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/model.py", line 2097, in location
    raw = self._backend.storage_get(self.full_id, 'location')
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/model.py", line 3367, in storage_get
    out = self._run(
  File "/var/lib/juju/agents/unit-vault-b-1/charm/venv/ops/model.py", line 3141, in _run
    raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR invalid value "certs/29" for option -s: getting filesystem attachment info: filesystem attachment "29" on "unit vault-b/1" not provisioned

unit-vault-b-1: 21:27:51 ERROR juju.worker.uniter.operation hook "vault-autounseal-requires-relation-created" (via hook dispatching script: dispatch) failed: exit status 1
controller-0: 21:27:52 ERROR juju.worker.caasapplicationprovisioner.runner exited "vault-b": getting OCI image resources: unable to fetch OCI image resources for vault-b: application "vault-b" dying or dead

Additional context

No response

DanielArndt commented 1 week ago

This issue has similarities (possibly related) to the error I get when I try to run the integration tests locally, although my error occurs in a different location. They both seem to have this:

controller-0: 21:24:58 ERROR juju.worker.caasunitprovisioner stopping application worker for minio: Operation cannot be fulfilled on statefulsets.apps "minio": StorageError: invalid object, Code: 4, Key: /registry/statefulsets/test-integration-g22n/minio, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: eb7ddc36-28f2-4d03-a4b4-46d715222fbf, UID in object meta:

DanielArndt commented 1 week ago

@gruyaume Have you observed this happening elsewhere? I only saw this happen on my PR, and I'm a little anxious to merge it unless we've observed it elsewhere.

gruyaume commented 1 week ago

Yes I've seen similar issues on main. For example:

https://github.com/canonical/vault-k8s-operator/actions/runs/9519849182/job/26244305237

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line 280, in <module>
    main(TLSRequirerCharm)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line 244, in _on_get_certificate_action
    if self._certificate_is_stored:
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line [221](https://github.com/canonical/vault-k8s-operator/actions/runs/9519849182/job/26244305237#step:9:222), in _certificate_is_stored
    return self._secret_exists(label=self._get_certificate_secret_label())
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line 233, in _secret_exists
    self.model.get_secret(label=label)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/model.py", line 280, in get_secret
    content = self._backend.secret_get(id=id, label=label)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/model.py", line 3370, in secret_get
    result = self._run('secret-get', *args, return_output=True, use_json=True)
  File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/model.py", line 3024, in _run
    raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR cannot access "cpm800ljtp7c7b4q4lig-1"

It does not seem to be the identical same thing but again we get a ModelError during get_secret

DanielArndt commented 1 week ago

awesome, thanks. more samples helps.

gruyaume commented 1 week ago

Ales Stimec also reported this error, which again, looks quite similar but is not exactly the same

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1575, in <module>
    main(VaultCharm)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 548, in main
    manager.run()
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 527, in run
    self._emit()
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 516, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 348, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 860, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 950, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 408, in _configure
    self._configure_pki_secrets_engine()
  File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 485, in _configure_pki_secrets_engine
    vault = self._get_active_vault_client()
  File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1377, in _get_active_vault_client
    role_id, secret_id = self._get_approle_auth_secret()
  File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1252, in _get_approle_auth_secret
    juju_secret = self.model.get_secret(label=VAULT_CHARM_APPROLE_SECRET_LABEL)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 285, in get_secret
    content = self._backend.secret_get(id=id, label=label)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 3504, in secret_get
    result = self._run('secret-get', *args, return_output=True, use_json=True)
  File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 3141, in _run
    raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR cannot ensure service account "unit-vault-0": Internal error occurred: resource quota evaluation timed out

I think the lesson is that we should catch ModelError during get_secret calls

DanielArndt commented 1 week ago

I think the lesson is that we should catch ModelError during get_secret calls

That might avoid crashing in these cases, but it would be really great to understand the cause of this.

The first case, for example, isn't caused by get_secret, it is on a call to get storage...

Something very similar to the first case was also reported in COS: https://github.com/canonical/cos-configuration-k8s-operator/issues/84

This may be an issue at the Juju level.

DanielArndt commented 1 week ago

Ales Stimec also reported this error, which again, looks quite similar but is not exactly the same

Where was this reported? What was the context? Was it still in integration tests?

gruyaume commented 1 week ago

Ales Stimec also reported this error, which again, looks quite similar but is not exactly the same

Where was this reported? What was the context? Was it still in integration tests?

In a private message on Mattermost with the following pastebin:

https://pastebin.canonical.com/p/fYtvHQ93xj/

As far as I know, this was not during an integration test

alesstimec commented 1 week ago

@DanielArndt @gruyaume this was in a live model on our prodstack. I believe there might have been an ongoing ceph outage at the time, so the entire controller was a bit sluggish to respond. To resolve the issue i had to redeploy vault once controller was back on track.

DanielArndt commented 2 days ago

So, I think there are 3 separate issues here, despite all 3 presenting similar symptoms.

The first case is this one:

ops.model.ModelError: ERROR invalid value "certs/29" for option -s: getting filesystem attachment info: filesystem attachment "29" on "unit vault-b/1" not provisioned

This is caused by an attempt to update the status of the charm before the storage has been attached. I verified this in the logs, and this is happening moments before the storage is attached through the normal charm lifecycle.

The fix should be pretty straightforward: catch the error, and return as if the file had not been found. This will set the status to "Waiting for CA certificate to be accessible in the charm", which seems appropriate.

The second case is

ops.model.ModelError: ERROR cannot access "cpm800ljtp7c7b4q4lig-1"

As the error suggests, it appears that charm doesn't have access to the secret. I need to do a bit more investigation because I wasn't able to figure out how this happens by looking at the charm and integration tests. The secret appears to be solely owned by the caller, so a call to secret.grant() shouldn't be necessary. I'm also not sure if this is the same error that is normally returned when asking for a secret we don't have access to... seems odd that it would crash the charm, so I'll need to do a bit more digging here.

I'm not quite positive how to move forward here, but I'll do some more digging.

The last case is the one reported by @alesstimec

ops.model.ModelError: ERROR cannot ensure service account "unit-vault-0": Internal error occurred: resource quota evaluation timed out

This one seems pretty straight forward in terms of how it happens, but a little more complicated for a fix. This error seems to come from k8s, and it looks like a resource (etcd? ceph?) is just taking too long to respond. This aligns with the ceph outage mentioned at the time.

We can catch the error here, but there are some implications. If we can't retrieve something because of an intermittent error, what do we set the status to? what if the calls are asymmetrical (we can retrieve in configure, but not in collect status, or vice-versa). Even more disruptively, what do we do when we're trying to store a secret and this happens? Since we don't use defers, we lose the context in which we were attempting to add/update this secret. I think this topic deserves a bigger discussion.

For now, I'll move forward with this and catch the error; I think we may be able to remove some dependence on secrets. For example, we store the CSR for PKI in 3 separate places -- vault, the relation data, and juju secrets. In the other cases, it should be fairly straight-forward to write the code such that subsequent calls will update the secret as expected, although there might be some inconsistency in the in-between time.

DanielArndt commented 2 days ago

I've added two new issues to cover the other cases:

414
415

I'm going to hide the comments here, and keep the contents relevant to the initial report. Investigation on the other two will continue in their respective issues.

DanielArndt commented 2 days ago

This is caused by an attempt to update the status of the charm before the storage has been attached. I verified this in the logs, and this is happening moments before the storage is attached through the normal charm lifecycle.

The fix should be pretty straightforward: catch the error, and return as if the file had not been found. This will set the status to "Waiting for CA certificate to be accessible in the charm", which seems appropriate.

canonical / vault-k8s-operator