canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 35 forks source link

Hook `update-status` fails #426

Closed sanchezfdezjavier closed 1 year ago

sanchezfdezjavier commented 1 year ago

Bug Description

The hook update-status fails intermittently after deployment. We are experiencing this issue in our CI pipeline in a not frequently active project, after a while, there was a change that triggered the CI and prometheus-k8s failed to deploy successfully. Although we've had successful runs, we're not able to reproduce the bug consistently.

Here's a sample failed run

To Reproduce

Deploy prometheus-k8s and wait for it to fail

juju deploy prometheus-k8s --trust

Environment

We've experienced the issue using the latest/stable and 2.9.38-ubuntu-amd64 in two different environments:

Relevant log output

controller-0: 09:41:15 INFO juju.worker.logforwarder config change - log forwarding not enabled
controller-0: 09:41:16 INFO juju.worker.logger logger worker started
controller-0: 09:41:16 INFO juju.worker.pruner.action status history config: max age: 336h0m0s, max collection size 5120M for test-integration-nltj (5b32b060-5900-4355-8d9c-394ec108cf2e)
controller-0: 09:41:16 INFO juju.worker.pruner.statushistory status history config: max age: 336h0m0s, max collection size 5120M for test-integration-nltj (5b32b060-5900-4355-8d9c-394ec108cf2e)
controller-0: 09:41:17 INFO juju.worker.logforwarder config change - log forwarding not enabled
model-5b32b060-5900-4355-8d9c-394ec108cf2e: 09:41:19 INFO juju.worker.caasupgrader abort check blocked until version event received
model-5b32b060-5900-4355-8d9c-394ec108cf2e: 09:41:19 INFO juju.worker.caasupgrader unblocking abort check
model-5b32b060-5900-4355-8d9c-394ec108cf2e: 09:41:19 INFO juju.worker.muxhttpserver starting http server on [::]:17071
model-5b32b060-5900-4355-8d9c-394ec108cf2e: 09:41:19 INFO juju.worker.caasadmission ensuring model k8s webhook configurations
controller-0: 09:41:21 INFO juju.worker.caasapplicationprovisioner.runner start "prometheus-k8s"
controller-0: 09:41:15 INFO juju.worker.apicaller [5b32b0] "controller-0" successfully connected to "localhost:17070"
controller-0: 09:41:30 INFO juju.worker.caasapplicationprovisioner.runner start "prometheus-configurer-k8s"
unit-prometheus-k8s-0: 09:41:38 INFO juju.cmd running containerAgent [2.9.38 6d211be0d72d6f4d625c61c7c4ddb4e9325226c8 gc go1.19.4]
unit-prometheus-k8s-0: 09:41:38 INFO juju.cmd.containeragent.unit start "unit"
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.upgradesteps upgrade steps for 2.9.38 have already been run.
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.probehttpserver starting http server on [::]:65301
unit-prometheus-k8s-0: 09:41:38 INFO juju.api cannot resolve "controller-service.controller-gcloud-jan-18-us-east1.svc.cluster.local": lookup controller-service.controller-gcloud-jan-18-us-east1.svc.cluster.local: operation was canceled
unit-prometheus-k8s-0: 09:41:38 INFO juju.api connection established to "wss://10.12.4.248:17070/model/5b32b060-5900-4355-8d9c-394ec108cf2e/api"
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.apicaller [5b32b0] "unit-prometheus-k8s-0" successfully connected to "10.12.4.248:17070"
unit-prometheus-k8s-0: 09:41:38 INFO juju.api connection established to "wss://controller-service.controller-gcloud-jan-18-us-east1.svc.cluster.local:17070/model/5b32b060-5900-4355-8d9c-394ec108cf2e/api"
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.apicaller [5b32b0] "unit-prometheus-k8s-0" successfully connected to "controller-service.controller-gcloud-jan-18-us-east1.svc.cluster.local:17070"
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.migrationminion migration phase is now: NONE
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.logger logger worker started
unit-prometheus-k8s-0: 09:41:38 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.caasupgrader unblocking abort check
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.leadership prometheus-k8s/0 promoted to leadership of prometheus-k8s
unit-prometheus-k8s-0: 09:41:38 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-prometheus-k8s-0
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.uniter unit "prometheus-k8s/0" started
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.uniter resuming charm install
unit-prometheus-k8s-0: 09:41:38 INFO juju.worker.uniter.charm downloading ch:amd64/focal/prometheus-k8s-79 from API server
unit-prometheus-k8s-0: 09:41:38 INFO juju.downloader downloading from ch:amd64/focal/prometheus-k8s-79
unit-prometheus-k8s-0: 09:41:39 INFO juju.downloader download complete ("ch:amd64/focal/prometheus-k8s-79")
unit-prometheus-k8s-0: 09:41:39 INFO juju.downloader download verified ("ch:amd64/focal/prometheus-k8s-79")
unit-prometheus-k8s-0: 09:41:43 INFO juju.worker.uniter hooks are retried true
unit-prometheus-k8s-0: 09:41:43 INFO juju.worker.uniter found queued "install" hook
unit-prometheus-k8s-0: 09:41:45 INFO unit.prometheus-k8s/0.juju-log Running legacy hooks/install.
unit-prometheus-k8s-0: 09:41:46 INFO unit.prometheus-k8s/0.juju-log Kubernetes service 'prometheus-k8s' patched successfully
unit-prometheus-k8s-0: 09:41:47 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:41:48 INFO juju.worker.uniter.operation ran "prometheus-peers-relation-created" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.cmd running containerAgent [2.9.38 6d211be0d72d6f4d625c61c7c4ddb4e9325226c8 gc go1.19.4]
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.cmd.containeragent.unit start "unit"
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.upgradesteps upgrade steps for 2.9.38 have already been run.
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.probehttpserver starting http server on [::]:65301
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.api cannot resolve "controller-service.controller-gcloud-jan-18-us-east1.svc.cluster.local": lookup controller-service.controller-gcloud-jan-18-us-east1.svc.cluster.local: operation was canceled
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.api connection established to "wss://34.23.118.147:17070/model/5b32b060-5900-4355-8d9c-394ec108cf2e/api"
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.apicaller [5b32b0] "unit-prometheus-configurer-k8s-0" successfully connected to "34.23.118.147:17070"
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.migrationminion migration phase is now: NONE
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.logger logger worker started
unit-prometheus-configurer-k8s-0: 09:41:48 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.caasupgrader unblocking abort check
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.worker.leadership prometheus-configurer-k8s/0 promoted to leadership of prometheus-configurer-k8s
unit-prometheus-configurer-k8s-0: 09:41:48 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-prometheus-configurer-k8s-0
unit-prometheus-configurer-k8s-0: 09:41:49 INFO juju.worker.uniter unit "prometheus-configurer-k8s/0" started
unit-prometheus-configurer-k8s-0: 09:41:49 INFO juju.worker.uniter resuming charm install
unit-prometheus-configurer-k8s-0: 09:41:49 INFO juju.worker.uniter.charm downloading local:focal/prometheus-configurer-k8s-0 from API server
unit-prometheus-configurer-k8s-0: 09:41:49 INFO juju.downloader downloading from local:focal/prometheus-configurer-k8s-0
unit-prometheus-k8s-0: 09:41:49 INFO juju.worker.uniter found queued "leader-elected" hook
unit-prometheus-configurer-k8s-0: 09:41:49 INFO juju.downloader download complete ("local:focal/prometheus-configurer-k8s-0")
unit-prometheus-configurer-k8s-0: 09:41:49 INFO juju.downloader download verified ("local:focal/prometheus-configurer-k8s-0")
unit-prometheus-k8s-0: 09:41:50 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:41:52 INFO juju.worker.uniter hooks are retried true
unit-prometheus-configurer-k8s-0: 09:41:53 INFO juju.worker.uniter found queued "install" hook
unit-prometheus-k8s-0: 09:41:53 INFO juju.worker.uniter.operation ran "database-storage-attached" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:41:53 INFO unit.prometheus-configurer-k8s/0.juju-log Running legacy hooks/install.
unit-prometheus-k8s-0: 09:41:54 INFO unit.prometheus-k8s/0.juju-log Kubernetes resources for app 'prometheus-k8s', container 'prometheus' patched successfully: ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'})
unit-prometheus-configurer-k8s-0: 09:41:54 INFO unit.prometheus-configurer-k8s/0.juju-log Kubernetes service 'prometheus-configurer-k8s' patched successfully
unit-prometheus-k8s-0: 09:41:54 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-configurer-k8s-0: 09:41:55 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:41:55 INFO juju.worker.uniter found queued "leader-elected" hook
unit-prometheus-k8s-0: 09:41:55 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:41:55 INFO juju.worker.uniter found queued "start" hook
unit-prometheus-k8s-0: 09:41:55 INFO unit.prometheus-k8s/0.juju-log Running legacy hooks/start.
unit-prometheus-configurer-k8s-0: 09:41:56 INFO juju.worker.uniter.operation ran "leader-elected" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:41:57 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:41:57 INFO unit.prometheus-configurer-k8s/0.juju-log Restarted container dummy-http-server
unit-prometheus-configurer-k8s-0: 09:41:57 INFO juju.worker.uniter.operation ran "dummy-http-server-pebble-ready" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:41:58 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:41:58 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 205.
unit-prometheus-configurer-k8s-0: 09:41:58 INFO juju.worker.uniter.operation ran "prometheus-configurer-pebble-ready" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:41:59 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-configurer-k8s-0: 09:41:59 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:41:59 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 274.
unit-prometheus-k8s-0: 09:41:59 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:00 INFO juju.worker.uniter.operation ran "rules-storage-attached" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:00 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:00 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 362.
unit-prometheus-configurer-k8s-0: 09:42:01 INFO juju.worker.uniter.operation ran "config-changed" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:01 INFO juju.worker.uniter found queued "start" hook
unit-prometheus-k8s-0: 09:42:01 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-configurer-k8s-0: 09:42:01 INFO unit.prometheus-configurer-k8s/0.juju-log Running legacy hooks/start.
unit-prometheus-k8s-0: 09:42:02 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:02 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:02 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 454.
unit-prometheus-configurer-k8s-0: 09:42:02 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:02 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 493.
unit-prometheus-configurer-k8s-0: 09:42:02 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:03 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:03 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 543.
unit-prometheus-configurer-k8s-0: 09:42:03 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:03 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-k8s-0: 09:42:04 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:04 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:04 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 614.
unit-prometheus-configurer-k8s-0: 09:42:05 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:06 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-k8s-0: 09:42:06 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:06 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:06 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 686.
unit-prometheus-configurer-k8s-0: 09:42:06 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:08 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-k8s-0: 09:42:08 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:08 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:08 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 759.
unit-prometheus-configurer-k8s-0: 09:42:09 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:10 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-k8s-0: 09:42:10 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:11 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:11 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 835.
unit-prometheus-configurer-k8s-0: 09:42:11 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:12 INFO unit.prometheus-k8s/0.juju-log reqs=ResourceRequirements(limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(limits=None, requests=None)
unit-prometheus-k8s-0: 09:42:13 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:13 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:13 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 909.
unit-prometheus-configurer-k8s-0: 09:42:13 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:14 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-prometheus-k8s-0: 09:42:14 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: signal: terminated
unit-prometheus-k8s-0: 09:42:14 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
unit-prometheus-configurer-k8s-0: 09:42:15 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:15 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 986.
unit-prometheus-k8s-0: 09:42:16 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
unit-prometheus-configurer-k8s-0: 09:42:16 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:17 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:17 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 1053.
unit-prometheus-configurer-k8s-0: 09:42:17 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:18 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
unit-prometheus-configurer-k8s-0: 09:42:19 INFO unit.prometheus-configurer-k8s/0.juju-log Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:19 INFO unit.prometheus-configurer-k8s/0.juju-log Started alert rules watchdog process with PID 1122.
unit-prometheus-configurer-k8s-0: 09:42:19 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-prometheus-k8s-0: 09:42:19 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
unit-prometheus-k8s-0: 09:42:19 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
unit-prometheus-configurer-k8s-0: 09:42:20 INFO unit.prometheus-configurer-k8s/0.juju-log prometheus:1: Starting alert rules watchdog.
unit-prometheus-configurer-k8s-0: 09:42:20 INFO unit.prometheus-configurer-k8s/0.juju-log prometheus:1: Started alert rules watchdog process with PID 1193.
unit-prometheus-configurer-k8s-0: 09:42:21 INFO unit.prometheus-configurer-k8s/0.juju-log prometheus:1: Restarted container prometheus-configurer
unit-prometheus-configurer-k8s-0: 09:42:22 INFO juju.worker.uniter.operation ran "prometheus-relation-created" hook (via hook dispatching script: dispatch)
unit-prometheus-configurer-k8s-0: 09:42:22 WARNING juju.worker.uniter.operation we should run a leader-deposed hook here, but we can't yet
unit-prometheus-configurer-k8s-0: 09:42:22 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-prometheus-configurer-k8s-0: 09:42:23 ERROR juju.worker.uniter.operation hook "rules-storage-detaching" (via hook dispatching script: dispatch) failed: signal: terminated
unit-prometheus-configurer-k8s-0: 09:42:23 INFO juju.worker.uniter awaiting error resolution for "storage-detaching" hook
unit-prometheus-configurer-k8s-0: 09:42:24 INFO juju.worker.uniter awaiting error resolution for "storage-detaching" hook
unit-prometheus-configurer-k8s-0: 09:42:26 INFO juju.worker.uniter awaiting error resolution for "storage-detaching" hook
controller-0: 09:43:26 INFO juju.worker.caasapplicationprovisioner.runner stopped "prometheus-configurer-k8s", err: application "prometheus-configurer-k8s" not found
controller-0: 09:43:26 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus-configurer-k8s": application "prometheus-configurer-k8s" not found
controller-0: 09:43:26 INFO juju.worker.caasapplicationprovisioner.runner restarting "prometheus-configurer-k8s" in 3s
controller-0: 09:43:29 INFO juju.worker.caasapplicationprovisioner.runner start "prometheus-configurer-k8s"
controller-0: 09:43:29 INFO juju.worker.caasapplicationprovisioner.runner stopped "prometheus-configurer-k8s", err: <nil>

Additional context

No response

rbarry82 commented 1 year ago

@sanchezfdezjavier was this when removing the application or destroying a model?

There's no Python traceback in the log, which makes me suspect it's the same Juju bug (we haven't seen it on 3.x in a while, but late 2.9 series saw it semi-frequently when churning models) where juju.worker.caasunitterminationworker sends a SIGTERM while a hook is running. Operator Framework doesn't have a signal handler, so it immediately shuts down and cleans up, returns a non-zero exit code to the hook runner, and Juju believes that it's failed.

There is/was a Launchpad issue about this, but I can never find it when I want it. @jameinel does this ring a bell? @benhoyt?

Not that OF necessarily needs to, but in our KubernetesComputeResourcesPatch, we set one up for a limited window of time which will allow Juju to happily continue on even if changing the PodSpec (first unit, usually) hits the scheduling window on the k8s reconciler loop and the Juju dispatch loop.

I can't think of many meaningful scenarios where executing charm code would receive a SIGTERM|SIGKILL|SIGHUP and actually want to exit 1, but I'm sure there are some.

benhoyt commented 1 year ago

@rbarry82 I believe you're thinking of https://bugs.launchpad.net/juju/+bug/1951415 that Tom is currently working on.

rbarry82 commented 1 year ago

That's the one. Thanks @benhoyt! I'll remember to look here so I don't get lost in similar Launchpad issues.

In our charms, whether by being one of the earliest OF charming teams or by accident, we're pretty good about guarding against exceptions from Pebble while it's dying, so containeragent going away early manifests as "pure" hook failures. Is there a reason you can think of why signals would make it to OF, though? That is -- is there a reason you can think of why OF shouldn't set up a signal handler which exits gracefully?

sanchezfdezjavier commented 1 year ago

@sanchezfdezjavier was this when removing the application or destroying a model?

There's no Python traceback in the log, which makes me suspect it's the same Juju bug (we haven't seen it on 3.x in a while, but late 2.9 series saw it semi-frequently when churning models) where juju.worker.caasunitterminationworker sends a SIGTERM while a hook is running. Operator Framework doesn't have a signal handler, so it immediately shuts down and cleans up, returns a non-zero exit code to the hook runner, and Juju believes that it's failed.

There is/was a Launchpad issue about this, but I can never find it when I want it. @jameinel does this ring a bell? @benhoyt?

Not that OF necessarily needs to, but in our KubernetesComputeResourcesPatch, we set one up for a limited window of time which will allow Juju to happily continue on even if changing the PodSpec (first unit, usually) hits the scheduling window on the k8s reconciler loop and the Juju dispatch loop.

I can't think of many meaningful scenarios where executing charm code would receive a SIGTERM|SIGKILL|SIGHUP and actually want to exit 1, but I'm sure there are some.

No, it didn't happen when removing or destroying a model. You can see the failed run here

Let me know how can I help.

This is the error log from the CI

------------------------------ live log teardown -------------------------------
[44](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:45)
INFO     pytest_operator.plugin:plugin.py:761 Model status:
[45](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:46)

[46](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:47)
Model                  Controller       Cloud/Region        Version  SLA          Timestamp
[47](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:48)
test-integration-vzo3  github-pr-7ad0d  microk8s/localhost  2.9.38   unsupported  14:14:09Z
[48](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:49)

[49](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:50)
App                        Version  Status   Scale  Charm                      Channel  Rev  Address         Exposed  Message
[50](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:51)
prometheus-configurer-k8s           waiting      1  prometheus-configurer-k8s             0  10.152.183.130  no       installing agent
[51](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:52)
prometheus-k8s                      waiting      1  prometheus-k8s             stable    79  10.152.183.47   no       installing agent
[52](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:53)

[53](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:54)
Unit                          Workload     Agent      Address      Ports  Message
[54](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:55)
prometheus-configurer-k8s/0*  maintenance  executing  10.1.233.13         Configuring pebble layer for prometheus-configurer
[55](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:56)
prometheus-k8s/0*             error        idle       10.1.233.11         hook failed: "update-status"
[56](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:57)

[57](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:58)

[58](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:59)

[59](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:60)
INFO     pytest_operator.plugin:plugin.py:767 Juju error logs:
[60](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:61)

[61](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:62)
unit-prometheus-k8s-0: 14:05:51 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: signal: terminated
[62](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:63)

[63](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:64)
INFO     pytest_operator.plugin:plugin.py:783 juju-crashdump finished [0]
[64](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:65)
INFO     pytest_operator.plugin:plugin.py:854 Resetting model test-integration-vzo3...
[65](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:66)
INFO     pytest_operator.plugin:plugin.py:843    Destroying applications prometheus-k8s
[66](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:67)
INFO     pytest_operator.plugin:plugin.py:843    Destroying applications prometheus-configurer-k8s
[67](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:68)
INFO     pytest_operator.plugin:plugin.py:859 Not waiting on reset to complete.
[68](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:69)
INFO     pytest_operator.plugin:plugin.py:832 Forgetting main...
[69](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:70)

[70](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:71)

[71](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:72)
=================================== FAILURES ===================================
[72](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:73)
_ TestPrometheusConfigurerOperatorCharm.test_given_prometheus_configurer_charm_in_blocked_status_when_prometheus_relation_created_then_charm_goes_to_active_status _
[73](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:74)
Traceback (most recent call last):
[74](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:75)
  File "/home/runner/work/prometheus-configurer-k8s-operator/prometheus-configurer-k8s-operator/tests/integration/test_integration.py", line 67, in test_given_prometheus_configurer_charm_in_blocked_status_when_prometheus_relation_created_then_charm_goes_to_active_status
[75](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:76)
    await ops_test.model.wait_for_idle(
[76](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:77)
  File "/home/runner/work/prometheus-configurer-k8s-operator/prometheus-configurer-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2707, in wait_for_idle
[77](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:78)
    _raise_for_status(errors, "error")
[78](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:79)
  File "/home/runner/work/prometheus-configurer-k8s-operator/prometheus-configurer-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2646, in _raise_for_status
[79](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:80)
    raise error_type("{}{} in {}: {}".format(
[80](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:81)
juju.errors.JujuUnitError: Unit in error: prometheus-k8s/0
rbarry82 commented 1 year ago

The key log messages are here:

unit-prometheus-k8s-0: 09:42:14 INFO juju.worker.caasunitterminationworker terminating due to SIGTERM
unit-prometheus-k8s-0: 09:42:14 ERROR juju.worker.uniter.operation hook "update-status" (via hook dispatching script: dispatch) failed: signal: terminated
unit-prometheus-k8s-0: 09:42:14 INFO juju.worker.uniter awaiting error resolution for "update-status" hook

Also, that the model had literally just come up:

INFO     pytest_operator.plugin:plugin.py:761 Model status:

Model                  Controller       Cloud/Region        Version  SLA          Timestamp
test-integration-vzo3  github-pr-7ad0d  microk8s/localhost  2.9.38   unsupported  14:14:09Z

App                        Version  Status   Scale  Charm                      Channel  Rev  Address         Exposed  Message
prometheus-configurer-k8s           waiting      1  prometheus-configurer-k8s             0  10.152.183.130  no       installing agent
prometheus-k8s                      waiting      1  prometheus-k8s             stable    79  10.152.183.[47](https://github.com/canonical/prometheus-configurer-k8s-operator/actions/runs/3959138602/jobs/6781573945#step:4:48)   no       installing agent

Unit                          Workload     Agent      Address      Ports  Message
prometheus-configurer-k8s/0*  maintenance  executing  10.1.233.13         Configuring pebble layer for prometheus-configurer
prometheus-k8s/0*             error        idle       10.1.233.11         hook failed: "update-status"

There is not even a containeragent running on the charm yet, and no charm code is involved. This is strictly something happening with Juju (on deployment rather than teardown, it seems)

sanchezfdezjavier commented 1 year ago

Good point, thanks for clarifying @rbarry82. I guess that I can close it, as this is not a prometheus problem, right?

rbarry82 commented 1 year ago

It's not, really, no. Hopefully we pointed you in the right direction though.