Closed sanchezfdezjavier closed 1 year ago
@sanchezfdezjavier was this when removing the application or destroying a model?
There's no Python traceback in the log, which makes me suspect it's the same Juju bug (we haven't seen it on 3.x in a while, but late 2.9 series saw it semi-frequently when churning models) where juju.worker.caasunitterminationworker
sends a SIGTERM
while a hook is running. Operator Framework doesn't have a signal handler, so it immediately shuts down and cleans up, returns a non-zero exit code to the hook runner, and Juju believes that it's failed.
There is/was a Launchpad issue about this, but I can never find it when I want it. @jameinel does this ring a bell? @benhoyt?
Not that OF necessarily needs to, but in our KubernetesComputeResourcesPatch
, we set one up for a limited window of time which will allow Juju to happily continue on even if changing the PodSpec
(first unit, usually) hits the scheduling window on the k8s reconciler loop and the Juju dispatch loop.
I can't think of many meaningful scenarios where executing charm code would receive a SIGTERM|SIGKILL|SIGHUP
and actually want to exit 1
, but I'm sure there are some.
@rbarry82 I believe you're thinking of https://bugs.launchpad.net/juju/+bug/1951415 that Tom is currently working on.
That's the one. Thanks @benhoyt! I'll remember to look here so I don't get lost in similar Launchpad issues.
In our charms, whether by being one of the earliest OF charming teams or by accident, we're pretty good about guarding against exceptions from Pebble while it's dying, so containeragent
going away early manifests as "pure" hook failures. Is there a reason you can think of why signals would make it to OF, though? That is -- is there a reason you can think of why OF shouldn't set up a signal handler which exits gracefully?
@sanchezfdezjavier was this when removing the application or destroying a model?
There's no Python traceback in the log, which makes me suspect it's the same Juju bug (we haven't seen it on 3.x in a while, but late 2.9 series saw it semi-frequently when churning models) where
juju.worker.caasunitterminationworker
sends aSIGTERM
while a hook is running. Operator Framework doesn't have a signal handler, so it immediately shuts down and cleans up, returns a non-zero exit code to the hook runner, and Juju believes that it's failed.There is/was a Launchpad issue about this, but I can never find it when I want it. @jameinel does this ring a bell? @benhoyt?
Not that OF necessarily needs to, but in our
KubernetesComputeResourcesPatch
, we set one up for a limited window of time which will allow Juju to happily continue on even if changing thePodSpec
(first unit, usually) hits the scheduling window on the k8s reconciler loop and the Juju dispatch loop.I can't think of many meaningful scenarios where executing charm code would receive a
SIGTERM|SIGKILL|SIGHUP
and actually want toexit 1
, but I'm sure there are some.
No, it didn't happen when removing or destroying a model. You can see the failed run here
Let me know how can I help.
This is the error log from the CI
------------------------------ live log teardown -------------------------------
[44](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:45)
INFO pytest_operator.plugin:plugin.py:761 Model status:
[45](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:46)
[46](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:47)
Model Controller Cloud/Region Version SLA Timestamp
[47](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:48)
test-integration-vzo3 github-pr-7ad0d microk8s/localhost 2.9.38 unsupported 14:14:09Z
[48](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:49)
[49](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:50)
App Version Status Scale Charm Channel Rev Address Exposed Message
[50](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:51)
prometheus-configurer-k8s waiting 1 prometheus-configurer-k8s 0 10.152.183.130 no installing agent
[51](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:52)
prometheus-k8s waiting 1 prometheus-k8s stable 79 10.152.183.47 no installing agent
[52](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:53)
[53](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:54)
Unit Workload Agent Address Ports Message
[54](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:55)
prometheus-configurer-k8s/0* maintenance executing 10.1.233.13 Configuring pebble layer for prometheus-configurer
[55](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:56)
prometheus-k8s/0* error idle 10.1.233.11 hook failed: "update-status"
[56](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:57)
[57](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:58)
[58](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:59)
[59](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:60)
INFO pytest_operator.plugin:plugin.py:767 Juju error logs:
[60](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:61)
[61](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:62)
unit-prometheus-k8s-0: 14:05:51 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: signal: terminated
[62](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:63)
[63](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:64)
INFO pytest_operator.plugin:plugin.py:783 juju-crashdump finished [0]
[64](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:65)
INFO pytest_operator.plugin:plugin.py:854 Resetting model test-integration-vzo3...
[65](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:66)
INFO pytest_operator.plugin:plugin.py:843 Destroying applications prometheus-k8s
[66](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:67)
INFO pytest_operator.plugin:plugin.py:843 Destroying applications prometheus-configurer-k8s
[67](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:68)
INFO pytest_operator.plugin:plugin.py:859 Not waiting on reset to complete.
[68](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:69)
INFO pytest_operator.plugin:plugin.py:832 Forgetting main...
[69](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:70)
[70](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:71)
[71](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:72)
=================================== FAILURES ===================================
[72](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:73)
_ TestPrometheusConfigurerOperatorCharm.test_given_prometheus_configurer_charm_in_blocked_status_when_prometheus_relation_created_then_charm_goes_to_active_status _
[73](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:74)
Traceback (most recent call last):
[74](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:75)
File "/home/runner/work/prometheus-configurer-k8s-operator/prometheus-configurer-k8s-operator/tests/integration/test_integration.py", line 67, in test_given_prometheus_configurer_charm_in_blocked_status_when_prometheus_relation_created_then_charm_goes_to_active_status
[75](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:76)
await ops_test.model.wait_for_idle(
[76](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:77)
File "/home/runner/work/prometheus-configurer-k8s-operator/prometheus-configurer-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2707, in wait_for_idle
[77](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:78)
_raise_for_status(errors, "error")
[78](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:79)
File "/home/runner/work/prometheus-configurer-k8s-operator/prometheus-configurer-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2646, in _raise_for_status
[79](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:80)
raise error_type("{}{} in {}: {}".format(
[80](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:81)
juju.errors.JujuUnitError: Unit in error: prometheus-k8s/0
The key log messages are here:
unit-prometheus-k8s-0: 09:42:14 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-prometheus-k8s-0: 09:42:14 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: signal: terminated
unit-prometheus-k8s-0: 09:42:14 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
Also, that the model had literally just come up:
INFO pytest_operator.plugin:plugin.py:761 Model status:
Model Controller Cloud/Region Version SLA Timestamp
test-integration-vzo3 github-pr-7ad0d microk8s/localhost 2.9.38 unsupported 14:14:09Z
App Version Status Scale Charm Channel Rev Address Exposed Message
prometheus-configurer-k8s waiting 1 prometheus-configurer-k8s 0 10.152.183.130 no installing agent
prometheus-k8s waiting 1 prometheus-k8s stable 79 10.152.183.[47](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:48) no installing agent
Unit Workload Agent Address Ports Message
prometheus-configurer-k8s/0* maintenance executing 10.1.233.13 Configuring pebble layer for prometheus-configurer
prometheus-k8s/0* error idle 10.1.233.11 hook failed: "update-status"
There is not even a containeragent running on the charm yet, and no charm code is involved. This is strictly something happening with Juju (on deployment rather than teardown, it seems)
Good point, thanks for clarifying @rbarry82. I guess that I can close it, as this is not a prometheus problem, right?
It's not, really, no. Hopefully we pointed you in the right direction though.
Bug Description
The hook
update-status
fails intermittently after deployment. We are experiencing this issue in our CI pipeline in a not frequently active project, after a while, there was a change that triggered the CI andprometheus-k8s
failed to deploy successfully. Although we've had successful runs, we're not able to reproduce the bug consistently.Here's a sample failed run
To Reproduce
Deploy
prometheus-k8s
and wait for it to failEnvironment
We've experienced the issue using the
latest/stable
and2.9.38-ubuntu-amd64
in two different environments:charmed-kubernetes/actions-operator@main
action.Relevant log output
Additional context
No response