balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
968 stars 115 forks source link

persistent logging does strange boot logging (missed boot logs between reboots) #1919

Open floion opened 4 years ago

floion commented 4 years ago

Reproduced on balenaOS 2.51.1+rev1 on rpi4

Steps to reproduce:

1 - In Device Configuration activate variable RESIN_SUPERVISOR_PERSISTENT_LOGGING and set it's state to Enabled. This will reboot the device.

2 - When the board gets online after step 1, wait for the supervisor to be started and reboot it again using the dashboard actions.

3 - When the board gets online after step 2, running journalctl --list-boots should show show 2 boot entries being logged:

-1 ea3e73a92f08427fbfe24cde398b0cbf Fri 2018-12-21 13:29:02 UTC—Fri 2018-12-21 13:30:31 UTC 0 f3309c07fcc8481595b0e6a803529043 Fri 2018-12-21 13:30:50 UTC—Fri 2018-12-21 13:31:09 UTC

but in my case it only showed:

0 6cef38676b3a4a15a39488bc02304fe5 Tue 2020-05-26 11:27:09 UTC—Mon 2020-06-08 08:02:38 UTC

If rebooting the board once again, it will show the 2 boots being logged. If then I reboot it again, it will show only 1 boot being logged. Rebooting it again it will show 2 boots again. And doing another reboot it showed 3 boots. So really random behavior.

vipulgupta2048 commented 4 years ago

Thanks @floion for opening the issue. Following it.

roman-mazur commented 4 years ago

Once this is closed, we should revert the OS test suite PR that disables failure reports for this: https://github.com/balena-os/leviathan/pull/338

markcorbinuk commented 4 years ago

Looks like this is going to be a problem for boards without an RTC :-(

I've been testing on a RPI3: 1) Start with a clean SD card 2) Boot system 3) Enable persistent logging via dashboard (which triggers a reboot)

If you check the journal at this point everything looks fine - the journal files are present in /var/log/journal/... and the journal is clean (i.e. no reboot messages).

4) Trigger a reboot via the dashboard

If you check the journal now you will see lots of reboot tags similar to the following:

-- Reboot --
May 11 11:35:49 localhost sh[844]: [INFO] Filesystem check on /dev/disk/by-state/resin-data...
-- Reboot --
May 11 11:35:49 localhost systemd-tmpfiles[845]: [/etc/tmpfiles.d/openvpn.conf:1] Line references path below legacy directory /var/run/, updating />
May 11 11:35:49 localhost sh[846]: [INFO] Filesystem check on /dev/disk/by-state/resin-data...
-- Reboot --
May 11 11:35:49 localhost systemd-tmpfiles[847]: [/etc/tmpfiles.d/openvpn.conf:1] Line references path below legacy directory /var/run/, updating />
May 11 11:35:49 localhost sh[844]: e2fsck 1.44.5 (15-Dec-2018)
-- Reboot --
May 11 11:35:49 localhost sh[846]: e2fsck 1.44.5 (15-Dec-2018)
-- Reboot --
May 11 11:35:49 localhost os-config[843]: `deviceApiKey` already generated
May 11 11:35:49 localhost sh[844]: resin-data: clean, 3423/3807232 files, 555529/30454272 blocks
May 11 11:35:49 localhost sh[844]: [INFO] Expand ext4 filesystem on /dev/disk/by-state/resin-data...
May 11 11:35:49 localhost sh[844]: resize2fs 1.44.5 (15-Dec-2018)

What is happening here is that journalctl is merging all the logs together so that they can be displayed in chronological order. Unfortunately, due to the lack of an RTC, all system boots have log messages that start with a date of 11th May 2020 before picking up a date of 10th June 2020 which is then finally replaced with the correct date/time via NTP. When the messages are all merged together we end up with a duplicate set of 'May and June' messages for every boot. Each boot has a unique BOOT_ID which is what journalctl uses to determine whether the system has rebooted. When the messages from one or more boots are merged together the BOOT_ID can change in rapid succession causing journalctl to register a reboot. You can see this in the above journal output where we can see log messages for systemd-tmpfiles (PIDs 845 and 847) and e2fsck (PIDs 844 & 846) from the 2 boot sequences.

The individual journal files in /var/log/journal/... look to be correct, the problem seems to occur when combining multiple files with duplicate times.

I found an open issue in the systemd repository from 2015 that has' finding a smart way to deal with this' listed as a feature request: https://github.com/systemd/systemd/issues/662

I suspect that the multiple reboot records are messing with the ability of journalctl to list-boots correctly.

vipulgupta2048 commented 4 years ago

Hey @markcorbinuk any update on this? thanks!

markcorbinuk commented 4 years ago

Hello @vipulgupta2048 I'm testing a potential fix for this at the moment. See #2008

acostach commented 4 years ago

Issue present on 2.56.0+rev2 for PI3 32bit

acostach commented 4 years ago

Issue present on 2020.07.2 Pi4

markcorbinuk commented 3 years ago

Closing this issue as the fake hwclock package (#2008 #2014 ) should address this problem.

vipulgupta2048 commented 3 years ago

@markcorbinuk :tada: :tada: :tada: Any idea what image I should watch out for this to be released on?

vipulgupta2048 commented 3 years ago

Activated persistent logging tests with the above still show flaky behavior in tests done on 1, 2, and 3. Mark will test more on this front and get back to us about it.

markcorbinuk commented 3 years ago

It does seem that there is another issue here that is not related to time sequencing of log messages. I booted a system with persistent logging which reported an incorrect number of boot cycles. The following message was observed in the output of dmesg:

systemd-journald[915]: File /var/log/journal/a86e747cc49d4b2a852324f1d8b23929/system.journal corrupted or uncleanly shut down, renaming and replacing.
vipulgupta2048 commented 3 years ago

@markcorbinuk Reaching back, I was reviewing the persistent logging test. It is still disabled. How's it looking over here?

markcorbinuk commented 3 years ago

@vipulgupta2048 I think that we can re-enable the test now that https://github.com/balena-os/meta-balena/pull/2114 has been merged.

vipulgupta2048 commented 3 years ago

@markcorbinuk I ran a couple of runs it failed once. I will make the PR nonetheless, the rig might be the best testing ground.