canonical / operator-libs-linux

Linux helper libraries for the operator framework
Apache License 2.0
11 stars 39 forks source link

When a snap fails to restart for some reason, e.g. "Start request repeated too quickly", log messages do not reflect the reason #136

Open sed-i opened 3 days ago

sed-i commented 3 days ago

Problem 1: set fails because charm is restarted to quickly

I need to call snap.set immediately after start/ensure, like so:

    if snap.present:
        if snap.revision != revision:
            cmd = ["snap", "refresh", "grafana-agent", f'--revision="{revision}"']
            if classic:
                cmd.append("--classic")
            subprocess.run(cmd)
            snap.start(enable=True)
    else:
        snap.ensure(state=snap_lib.SnapState.Present, revision=revision, classic=classic)

    if config:
        snap.set(config)

    snap.hold()

At this point the call to snap.set fails with:

Nov 14 19:57:03 juju-c56ce2-18 systemd[1]: snap.grafana-agent.grafana-agent.service: Scheduled restart job, restart counter is at 5.
Nov 14 19:57:03 juju-c56ce2-18 systemd[1]: Stopped Service for snap application grafana-agent.grafana-agent.
Nov 14 19:57:03 juju-c56ce2-18 systemd[1]: snap.grafana-agent.grafana-agent.service: Start request repeated too quickly.
Nov 14 19:57:03 juju-c56ce2-18 systemd[1]: snap.grafana-agent.grafana-agent.service: Failed with result 'exit-code'.
Nov 14 19:57:03 juju-c56ce2-18 systemd[1]: Failed to start Service for snap application grafana-agent.grafana-agent.
Nov 14 19:57:09 juju-c56ce2-18 systemd[1]: snap.grafana-agent.grafana-agent.service: Start request repeated too quickly.
Nov 14 19:57:09 juju-c56ce2-18 systemd[1]: snap.grafana-agent.grafana-agent.service: Failed with result 'exit-code'.
Nov 14 19:57:09 juju-c56ce2-18 systemd[1]: Failed to start Service for snap application grafana-agent.grafana-agent.

Problem 2a: the log message doesn't show stderr

In the traceback I see messages such as

charms.operator_libs_linux.v2.snap.SnapError: Snap: 'grafana-agent'; command ['snap', 'set', 'grafana-agent', 'reporting-enabled=1'] failed with output = ''

because SnapError doesn't print e.stderr. Perhaps it would be handy to also include e.stderr here:

https://github.com/canonical/operator-libs-linux/blob/a1aaa35135220d564f7a63854388b38752de09c4/lib/charms/operator_libs_linux/v2/snap.py#L280-L283

Problem 2b: this error doesn't go to stderr anyway, so pull from journalctl or systemctl?

I manually added stderr to the log but discovered there's nothing there, all the errors are logged to journal. Should the lib pull from the journal when raising SnapError?

sed-i commented 3 days ago

Note: the rapid restart is due to an error I discovered with snap logs, due to a build error: https://github.com/canonical/grafana-agent-snap/pull/80.

So there is no problem with snap.set, the only problem is that the reason for restart failure is buried.