coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
263 stars 60 forks source link

hanging `journalctl --list-boots` stops boot #1641

Open dustymabe opened 9 months ago

dustymabe commented 9 months ago

This was originally reported in https://discussion.fedoraproject.org/t/fcos-39-fails-booting-due-journalctl-list-boots-never-return/97006/6

The coreos-ignition-write-issues.service service hangs on boot and prevents the boot from progressing further. The [call]() that is getting stuck is journalctl --list-boots.

For some reason after the transition to F39 there is some corner case where journalctl --list-boots can just run indefinitely.

I had a Rpi4 that experienced this on the 38.20231027.3.2 -> 39.20231101.3.0 transition.

Booting the old 38.20231027.3.2 entry I was able to get the system up and running again. Inspecting the system showed:

[core@pibackup ~]$ rpm -q systemd
systemd-253.12-1.fc38.aarch64

[core@pibackup ~]$ time journalctl --list-boots
IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
-24 df49a358474449c1bb20e3761543d1ca Thu 2023-05-11 00:00:00 UTC Sat 2023-07-22 17:56:26 UTC
-23 017a791eae7b422492d6f8dae896032e Thu 2023-05-11 00:00:01 UTC Thu 2023-05-11 00:08:31 UTC
-22 5abbc0f2677548c1b8b5c68cdddd601d Thu 2023-05-11 00:00:01 UTC Thu 2023-05-11 00:46:26 UTC
-21 57002a420b4a45f68ab546e641f52284 Thu 2023-05-11 00:00:01 UTC Sun 2023-07-23 03:27:15 UTC
-20 9ac2b1967ca44bd48708b5bbcdc2d4bb Thu 2023-05-11 00:00:01 UTC Sun 2023-07-23 03:49:06 UTC
-19 674dd707fd064aa6a1d588710a923014 Thu 2023-05-11 00:00:01 UTC Sat 2023-07-29 11:05:40 UTC
-18 be52536bcc114003a8aa1437e7dce1da Thu 2023-05-11 00:00:01 UTC Sat 2023-08-12 11:03:31 UTC
-17 4ea39bca40ee4680b07a6ddd3a29af5c Thu 2023-05-11 00:00:01 UTC Sat 2023-08-26 11:04:55 UTC
-16 d362b25669ac4a7ea0a4e354929f1632 Thu 2023-05-11 00:00:01 UTC Mon 2023-08-28 00:02:45 UTC
-15 8769bd807f974112ab0b5e49ce23c04f Thu 2023-05-11 00:00:01 UTC Mon 2023-08-28 02:10:14 UTC
-14 65ddbc1db5a04c099116d3b4afae42fa Thu 2023-05-11 00:00:01 UTC Tue 2023-08-29 05:48:18 UTC
-13 00ab9ff86ff6472f84e055a100411165 Thu 2023-05-11 00:00:01 UTC Tue 2023-08-29 17:56:41 UTC
-12 8571251afa6749e0b5eb45c87baa2d9d Thu 2023-05-11 00:00:01 UTC Sat 2023-09-09 11:05:07 UTC
-11 68869bef4560454087516bfd679bd663 Mon 2023-07-17 00:00:01 UTC Sat 2023-09-09 11:21:19 UTC
-10 8012963ae8264e809248dbfa01e4fc83 Mon 2023-07-17 00:00:01 UTC Sat 2023-09-23 11:00:24 UTC
 -9 2896ecd773e84c3c8c7d3b9d721531ab Mon 2023-07-17 00:00:01 UTC Sat 2023-10-07 11:03:45 UTC
 -8 28c0be761de8431d999631a176dcec26 Fri 2023-09-08 00:00:01 UTC Sat 2023-10-14 11:00:24 UTC
 -7 d6b26f4b3e9e487e947b34000273fdb7 Fri 2023-09-08 00:00:01 UTC Fri 2023-10-20 11:20:33 UTC
 -6 3dc9125c81e94adba4b64394f37e539b Fri 2023-09-08 00:00:01 UTC Sat 2023-10-21 11:01:44 UTC
 -5 a1882054dfe74b4c9a9b558c929e42ff Mon 2023-09-18 00:00:01 UTC Sat 2023-11-04 11:03:06 UTC
 -4 c88914fc389b49e5b52a605a1b665923 Mon 2023-09-18 00:00:01 UTC Sat 2023-11-11 11:04:21 UTC
 -3 abb00cbc2b8b42f6a55d2ae3829d2949 Wed 2023-10-25 00:00:01 UTC Sat 2023-11-25 11:04:19 UTC
 -2 09b927cf472040a9ae8ecd564a9c36e5 Sat 2023-11-25 11:04:19 UTC Sun 2023-11-26 00:00:00 UTC
 -1 0fa0118dbfe84b49904f58807328878f Sun 2023-11-26 00:29:47 UTC Sun 2023-12-31 20:40:50 UTC
  0 e1afeb99e3aa49f1bb16515633d319c9 Wed 2023-10-25 00:00:00 UTC Sun 2023-12-31 20:59:53 UTC

real    0m0.103s
user    0m0.027s
sys     0m0.074s

[core@pibackup ~]$ du -sh /var/log/journal/
677M    /var/log/journal/

[core@pibackup ~]$ ls /var/log/journal/*/ | wc -l
90

I added a TimeoutStartSec override as suggested in the discussion forum post

[core@pibackup:~$ cat /etc/systemd/system/coreos-ignition-write-issues.service.d/override.conf
[Service]
TimeoutStartSec=120

and was able to then upgrade to F39.

Once the system is up I then see:

core@pibackup:~$ rpm -q systemd
systemd-254.5-2.fc39.aarch64

core@pibackup:~$ du -sh /var/log/journal/

706M    /var/log/journal/
core@pibackup:~$ ls /var/log/journal/c01e7581164c4a188164c602616cdeba/ | wc -l
94

core@pibackup:~$ time journalctl --list-boots
^C
real    115m45.212s
user    41m1.402s
sys     72m43.511s

I had to CTRL-C out of the process after almost 2 hours. Before the upgrade the process finished in less than a second.

As suggested in the discussion forum we should probably find a better way to achieve our goal rather than running journalctl --list-boots on startup anyway, but there is clearly IMO a bug in systemd here that was introduced with systemd-253.12-1.fc38 ⟶ 254.5-2.fc39.

jlebon commented 9 months ago

We should probably add a QEMU test that sets e.g. -rtc=1970-01-01 to get coverage on these types of issues. (I don't think we need to try to get things to work perfectly in that mode, but at least not completely breaking boot.)

perosb commented 8 months ago

Any workarounds for this when you cannot ssh into rpi4? Several rpi4 now claims "system is boosting up" so cannot ssh into them.

dustymabe commented 8 months ago

Try to catch the kernel command line and add systemd.mask=coreos-ignition-write-issues.service to it. I think that should let you get the system up and then you can apply the workaround as suggested in the description here.

I've engaged @keszybz on this issue and he is looking into it so I'm hoping we'll have a fix soon.

Alternatively we should probably look at not running journalctl --list-boots on boot and using some other mechanism for getting this information.

dustymabe commented 6 months ago

Proposed fix to the infinite loop problem in https://github.com/systemd/systemd/pull/31975

Though, note: that PR doesn't address:

Alternatively we should probably look at not running journalctl --list-boots on boot and using some other mechanism for getting this information.

dustymabe commented 6 months ago

Proposed fix to the infinite loop problem in systemd/systemd#31975

Merged in https://github.com/systemd/systemd/commit/1e8c0c671e3076db811804343b3b8d744bcf27ac

dustymabe commented 1 month ago

Fixed in systemd v256 and newer which is in F41 now.

dustymabe commented 4 weeks ago

The fix for this went into next stream release 41.20240916.1.0. Please try out the new release and report issues.