balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
966 stars 115 forks source link

Watchdog should not assume timeout when the system time is adjusted #2221

Open pdcastro opened 3 years ago

pdcastro commented 3 years ago

Investigating a support thread, I suspect (not sure) that a watchdog restarted balenaEngine because it thought that the engine timed out, when in fact it did not time out, just the system time jumped because chronyd adjusted it.

On device boot, initially at 12:00:47 UTC timestamp, check how balenaEngine "begins to start":

Jun 16 12:00:47 61146d8 balenad[1021]: time="2021-06-16T12:00:47.057292036Z" level=info msg="Starting up"
Jun 16 12:00:47 61146d8 balenad[1021]: time="2021-06-16T12:00:47.069265193Z" level=warning msg="Running experimental build"
Jun 16 12:00:47 61146d8 balenad[1021]: time="2021-06-16T12:00:47.174972641Z" level=info msg="libcontainerd: started new balena-engine-containerd process" pid=1065

Then a few seconds later, the system time jumps from 12:00 to 12:43:

Jun 16 12:00:55 61146d8 openvpn[1015]: Wed Jun 16 12:00:55 2021 VERIFY EKU OK
Jun 16 12:00:55 61146d8 openvpn[1015]: Wed Jun 16 12:00:55 2021 VERIFY OK: depth=0, C=US, ST=WA, O=balena.io, OU=balenaCloud, CN=vpn.balena-cloud.com
Jun 16 12:43:23 61146d8 chronyd[852]: 2021-06-16T12:00:57Z Selected source 10.250.255.253
Jun 16 12:43:23 61146d8 chronyd[852]: 2021-06-16T12:00:57Z System clock wrong by 2546.206649 seconds
Jun 16 12:43:23 61146d8 chronyd[852]: 2021-06-16T12:43:23Z System clock was stepped by 2546.206649 seconds

Then a minute later, "boom":

Jun 16 12:44:37 61146d8 systemd[1]: balena.service: start operation timed out. Terminating.
Jun 16 12:44:37 61146d8 balenad[1021]: time="2021-06-16T12:44:37.923441909Z" level=info msg="Processing signal 'terminated'"
Jun 16 12:44:45 61146d8 balenad[1021]: time="2021-06-16T12:44:45.459025211Z" level=info msg="Loading containers: done."
Jun 16 12:44:48 61146d8 balenad[1021]: time="2021-06-16T12:44:48.538097566Z" level=info msg="Docker daemon" commit=074a481789174b4b6fd2d706086e8ffceb72e924 graphdriv>
Jun 16 12:44:48 61146d8 balenad[1021]: time="2021-06-16T12:44:48.548452862Z" level=info msg="Daemon has completed initialization"
Jun 16 12:44:48 61146d8 balenad[1021]: time="2021-06-16T12:44:48.920202012Z" level=info msg="API listen on /var/run/balena-engine.sock"
Jun 16 12:44:48 61146d8 balenad[1021]: time="2021-06-16T12:44:48.924905563Z" level=info msg="API listen on /run/balena-engine.sock"
Jun 16 12:44:48 61146d8 balenad[1021]: time="2021-06-16T12:44:48.931987805Z" level=info msg="API listen on /var/run/balena.sock"
Jun 16 12:45:03 61146d8 balenad[1021]: time="2021-06-16T12:45:03.941284956Z" level=error msg="Force shutdown daemon"
Jun 16 12:45:03 61146d8 balenad[1021]: time="2021-06-16T12:45:03.948576102Z" level=info msg="Daemon shutdown complete"
Jun 16 12:45:03 61146d8 balenad[1021]: time="2021-06-16T12:45:03.949143818Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
Jun 16 12:45:03 61146d8 balenad[1021]: time="2021-06-16T12:45:03.980403442Z" level=info msg="stopping event stream following graceful shutdown" error="context cancel>
Jun 16 12:45:04 61146d8 balenad[1021]: time="2021-06-16T12:45:03.984323415Z" level=info msg="stopping event stream following graceful shutdown" error="context cancel>
Jun 16 12:45:05 61146d8 systemd[1]: balena.service: Failed with result 'timeout'.
Jun 16 12:45:05 61146d8 systemd[1]: Failed to start Balena Application Container Engine.
Jun 16 12:45:05 61146d8 systemd[1]: balena.service: Found left-over process 1025 (exe) in control group while starting unit. Ignoring.
Jun 16 12:45:05 61146d8 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 16 12:45:05 61146d8 systemd[1]: balena.service: Found left-over process 1432 (balena-engine-c) in control group while starting unit. Ignoring.
Jun 16 12:45:05 61146d8 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 16 12:45:05 61146d8 systemd[1]: balena.service: Found left-over process 1462 (balena-engine-c) in control group while starting unit. Ignoring.
Jun 16 12:45:05 61146d8 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 16 12:45:05 61146d8 systemd[1]: balena.service: Found left-over process 1484 (balena-engine-c) in control group while starting unit. Ignoring.
Jun 16 12:45:05 61146d8 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jun 16 12:45:05 61146d8 systemd[1]: balena.service: Found left-over process 1594 (balena-engine-c) in control group while starting unit. Ignoring.
Jun 16 12:45:05 61146d8 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

The device was a BeagleBone Black running balenaOS 2.73.1+rev2 (prod), supervisor 12.4.6.

jellyfish-bot commented 3 years ago

[pdcastro] This issue has attached support thread https://jel.ly.fish/d530e774-0d99-4a90-9c6a-d8646495246b

pdcastro commented 3 years ago

Diagnostic logs: 61146d8974f041c8ee712886f0521118_diagnostics_2021.06.17_18.02.39+0000.txt