Closed gruyaume closed 2 days ago
This issue has similarities (possibly related) to the error I get when I try to run the integration tests locally, although my error occurs in a different location. They both seem to have this:
controller-0: 21:24:58 ERROR juju.worker.caasunitprovisioner stopping application worker for minio: Operation cannot be fulfilled on statefulsets.apps "minio": StorageError: invalid object, Code: 4, Key: /registry/statefulsets/test-integration-g22n/minio, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: eb7ddc36-28f2-4d03-a4b4-46d715222fbf, UID in object meta:
@gruyaume Have you observed this happening elsewhere? I only saw this happen on my PR, and I'm a little anxious to merge it unless we've observed it elsewhere.
Yes I've seen similar issues on main
. For example:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line 280, in <module>
main(TLSRequirerCharm)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 544, in main
manager.run()
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 520, in run
self._emit()
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 509, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/main.py", line 143, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/framework.py", line 352, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/framework.py", line 851, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/framework.py", line 941, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line 244, in _on_get_certificate_action
if self._certificate_is_stored:
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line [221](https://github.com/canonical/vault-k8s-operator/actions/runs/9519849182/job/26244305237#step:9:222), in _certificate_is_stored
return self._secret_exists(label=self._get_certificate_secret_label())
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/./src/charm.py", line 233, in _secret_exists
self.model.get_secret(label=label)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/model.py", line 280, in get_secret
content = self._backend.secret_get(id=id, label=label)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/model.py", line 3370, in secret_get
result = self._run('secret-get', *args, return_output=True, use_json=True)
File "/var/lib/juju/agents/unit-tls-certificates-requirer-0/charm/venv/ops/model.py", line 3024, in _run
raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR cannot access "cpm800ljtp7c7b4q4lig-1"
It does not seem to be the identical same thing but again we get a ModelError during get_secret
awesome, thanks. more samples helps.
Ales Stimec also reported this error, which again, looks quite similar but is not exactly the same
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1575, in <module>
main(VaultCharm)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 548, in main
manager.run()
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 527, in run
self._emit()
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 516, in _emit
_emit_charm_event(self.charm, self.dispatcher.event_name)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/main.py", line 147, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 348, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 860, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/framework.py", line 950, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 408, in _configure
self._configure_pki_secrets_engine()
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 485, in _configure_pki_secrets_engine
vault = self._get_active_vault_client()
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1377, in _get_active_vault_client
role_id, secret_id = self._get_approle_auth_secret()
File "/var/lib/juju/agents/unit-vault-0/charm/./src/charm.py", line 1252, in _get_approle_auth_secret
juju_secret = self.model.get_secret(label=VAULT_CHARM_APPROLE_SECRET_LABEL)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 285, in get_secret
content = self._backend.secret_get(id=id, label=label)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 3504, in secret_get
result = self._run('secret-get', *args, return_output=True, use_json=True)
File "/var/lib/juju/agents/unit-vault-0/charm/venv/ops/model.py", line 3141, in _run
raise ModelError(e.stderr) from e
ops.model.ModelError: ERROR cannot ensure service account "unit-vault-0": Internal error occurred: resource quota evaluation timed out
I think the lesson is that we should catch ModelError during get_secret calls
I think the lesson is that we should catch ModelError during get_secret calls
That might avoid crashing in these cases, but it would be really great to understand the cause of this.
The first case, for example, isn't caused by get_secret
, it is on a call to get storage...
Something very similar to the first case was also reported in COS: https://github.com/canonical/cos-configuration-k8s-operator/issues/84
This may be an issue at the Juju level.
Ales Stimec also reported this error, which again, looks quite similar but is not exactly the same
Where was this reported? What was the context? Was it still in integration tests?
Ales Stimec also reported this error, which again, looks quite similar but is not exactly the same
Where was this reported? What was the context? Was it still in integration tests?
In a private message on Mattermost with the following pastebin:
As far as I know, this was not during an integration test
@DanielArndt @gruyaume this was in a live model on our prodstack. I believe there might have been an ongoing ceph outage at the time, so the entire controller was a bit sluggish to respond. To resolve the issue i had to redeploy vault once controller was back on track.
So, I think there are 3 separate issues here, despite all 3 presenting similar symptoms.
The first case is this one:
ops.model.ModelError: ERROR invalid value "certs/29" for option -s: getting filesystem attachment info: filesystem attachment "29" on "unit vault-b/1" not provisioned
This is caused by an attempt to update the status of the charm before the storage has been attached. I verified this in the logs, and this is happening moments before the storage is attached through the normal charm lifecycle.
The fix should be pretty straightforward: catch the error, and return as if the file had not been found. This will set the status to "Waiting for CA certificate to be accessible in the charm", which seems appropriate.
The second case is
ops.model.ModelError: ERROR cannot access "cpm800ljtp7c7b4q4lig-1"
As the error suggests, it appears that charm doesn't have access to the secret. I need to do a bit more investigation because I wasn't able to figure out how this happens by looking at the charm and integration tests. The secret appears to be solely owned by the caller, so a call to secret.grant()
shouldn't be necessary. I'm also not sure if this is the same error that is normally returned when asking for a secret we don't have access to... seems odd that it would crash the charm, so I'll need to do a bit more digging here.
I'm not quite positive how to move forward here, but I'll do some more digging.
The last case is the one reported by @alesstimec
ops.model.ModelError: ERROR cannot ensure service account "unit-vault-0": Internal error occurred: resource quota evaluation timed out
This one seems pretty straight forward in terms of how it happens, but a little more complicated for a fix. This error seems to come from k8s, and it looks like a resource (etcd? ceph?) is just taking too long to respond. This aligns with the ceph outage mentioned at the time.
We can catch the error here, but there are some implications. If we can't retrieve something because of an intermittent error, what do we set the status to? what if the calls are asymmetrical (we can retrieve in configure, but not in collect status, or vice-versa). Even more disruptively, what do we do when we're trying to store a secret and this happens? Since we don't use defers, we lose the context in which we were attempting to add/update this secret. I think this topic deserves a bigger discussion.
For now, I'll move forward with this and catch the error; I think we may be able to remove some dependence on secrets. For example, we store the CSR for PKI in 3 separate places -- vault, the relation data, and juju secrets. In the other cases, it should be fairly straight-forward to write the code such that subsequent calls will update the secret as expected, although there might be some inconsistency in the in-between time.
I've added two new issues to cover the other cases:
I'm going to hide the comments here, and keep the contents relevant to the initial report. Investigation on the other two will continue in their respective issues.
This is caused by an attempt to update the status of the charm before the storage has been attached. I verified this in the logs, and this is happening moments before the storage is attached through the normal charm lifecycle.
The fix should be pretty straightforward: catch the error, and return as if the file had not been found. This will set the status to "Waiting for CA certificate to be accessible in the charm", which seems appropriate.
Bug Description
We get an
ops.model.ModelError
error in integration tests once in a while. Please look at the run example for more details:To Reproduce
Run integration tests a couple of time
Environment
Relevant log output
Additional context
No response