Closed someplaceguy closed 1 year ago
I'm sorry, but I just realized that there's something misleading about my logs.
I don't think the timestamps in my logs actually correspond to when the events happened, because the log messages from boot actually get written to the log files when rsyslog
starts, which is when the messages actually get timestamped, i.e. all the messages from boot get a similar timestamp because that's when rsyslog
starts and dumps whatever is in /dev/kmsg
to the log files.
So when I said "almost immediately after", this may have been quite misleading. I've added a couple of notes to the original description to reflect that.
When the failure happens, it doesn't allow me to log in, as the messages suggest I should. Instead, it says Cannot open access to console, the root account is locked. See sulogin(8) man page for more details. I had never experienced such issues regarding logging in as root.
Currently the root account is not shared by default with systemd stage 1. You can set boot.initrd.systemd.emergencyAccess
to change this. The valid values are a hashed password, true for unauthenticated access, or false for no access (default false). I personally use boot.initrd.systemd.emergencyAccess = config.user.users.root.hashedPassword;
but this assumes you're using hashedPassword
on your regular root account.
Anyway, the problem is clearly that right after systemd-sysroot-fstab-check
(which is initrd-parse-etc.service
), everything starts getting unmounted. I don't even see a hint of initrd-nixos-activation.service
running first, which is a huge red flag. It looks like it's trying to systemctl isolate
something without the proper dependencies existing. Now, initrd-parse-etc.service
does have a final effect of starting initrd-cleanup.service
, which does do systemctl --no-block isolate initrd-switch-root.target
, but initrd-cleanup.service
is still ordered After=initrd.target
, so it should still let everything else happen first, which is not happening here. I do not think it's initrd-cleanup.service
that's the culprit.
I mean, it's immediate. It mounts /sysroot
, then it immediately (correctly) does initrd-parse-etc.service
(the systemd-sysroot-fstab-check
stuff), but then it immediately goes to isolate for switch-root, which is very hard to explain.
I think we need to see more of your configuration. There's clearly something unusual that we're not seeing here.
I personally use boot.initrd.systemd.emergencyAccess = config.user.users.root.hashedPassword; but this assumes you're using hashedPassword on your regular root account.
I will try this, thanks!
@ElvishJerricco Regarding your diagnosis, did you check the full log which I had appended at the bottom of the initial description of this issue?
The events you mentioned seem to be happening like you expect they should:
023-07-23T20:15:03.277858+00:00 wizylap daemon.info systemd-sysroot-fstab-check[680]: /sysroot should be mounted in the initrd, will request daemon-reload.
2023-07-23T20:15:03.277862+00:00 wizylap daemon.info systemd[1]: Reloading requested from client PID 680 (unit initrd-parse-etc.service)...
2023-07-23T20:15:03.277865+00:00 wizylap daemon.info systemd[1]: Reloading...
2023-07-23T20:15:03.277869+00:00 wizylap daemon.info systemd[1]: Reloading finished in 128 ms.
2023-07-23T20:15:03.277874+00:00 wizylap daemon.info systemd-sysroot-fstab-check[680]: Requesting initrd-fs.target/start/replace...
2023-07-23T20:15:03.277879+00:00 wizylap daemon.info systemd-sysroot-fstab-check[680]: Requesting swap.target/start/replace...
2023-07-23T20:15:03.277884+00:00 wizylap daemon.info systemd[1]: initrd-parse-etc.service: Deactivated successfully.
Here we just saw systemd-sysroot-fstab-check
/ initrd-parse-etc.service
running.
Immediately afterwards:
2023-07-23T20:15:03.277888+00:00 wizylap daemon.info systemd[1]: Finished Mountpoints Configured in the Real Root.
2023-07-23T20:15:03.277891+00:00 wizylap daemon.info systemd[1]: Reached target Initrd File Systems.
2023-07-23T20:15:03.277895+00:00 wizylap daemon.info systemd[1]: Starting NixOS Activation...
2023-07-23T20:15:03.277900+00:00 wizylap daemon.info initrd-nixos-activation-start[690]: booting system configuration /nix/store/6gzh1m84v0y05ckp5yjmq4xdcliwph4x-nixos-system
-wizylap-23.05.git.c5683679fda4
2023-07-23T20:15:03.277905+00:00 wizylap daemon.info initrd-nixos-activation-start[690]: running activation script...
2023-07-23T20:15:03.277909+00:00 wizylap daemon.info initrd-nixos-activation-start[721]: setting up /etc...
2023-07-23T20:15:03.277914+00:00 wizylap daemon.info systemd[1]: initrd-nixos-activation.service: Deactivated successfully.
2023-07-23T20:15:03.277918+00:00 wizylap daemon.info systemd[1]: Finished NixOS Activation.
Here we saw initrd-nixos-activation
script / service running. Immediately afterwards:
2023-07-23T20:15:03.277921+00:00 wizylap daemon.info systemd[1]: Reached target Initrd Default Target.
2023-07-23T20:15:03.277926+00:00 wizylap daemon.info systemd[1]: Starting Cleaning Up and Shutting Down Daemons...
Here we saw initrd.target
being reached and only afterwards did we see initrd-cleanup.service
starting.
So as far as I can tell there is nothing wrong with any of the issues you mentioned?
I have noticed that after all this has happened, /sysroot
gets unmounted, then initrd-switch-root.service
tries to start but fails because the os-release
file is missing inside /sysroot
.
I have also noticed that in those boots where I don't experience this issue, /sysroot
does not get unmounted before initrd-switch-root.service
starts.
But the other mountpoints such as /sysroot/{dev,run,...}
always do get unmounted just before initrd-switch-root.service
starts, whether I experience this issue or not.
I don't know what is causing /sysroot
to get unmounted, though. I have enabled the rd.systemd.log_level=debug
kernel parameter so when I experience this issue again, I will provide more verbose logs.
Ok, so I think I may have diagnosed this issue, although I could be wrong because I'm a bit out of my depth here.
You said that initrd-cleanup.service
does systemctl --no-block isolate initrd-switch-root.target
, which I believe to be true according to the contents of my initrd.
The systemctl isolate
command does the following according to systemctl's man page:
isolate UNIT
Start the unit specified on the command line and its dependencies and stop all others, unless they have IgnoreOnIsolate=yes
Now I have two questions:
systemctl isolate
stop mount units as well?sysroot.mount
unit contain the IgnoreOnIsolate=yes
config option?The answer to the second question seems to be no, because only journald-related units have IgnoreOnIsolate=yes
.
My answer to the first question seems to be "I think so?", because my initrd has no file that appears to be explicitly unmounting /sysroot/run
, nor does it have any unit which mentions sysroot-run.mount
, and yet /sysroot/run
is getting unmounted just after initrd-cleanup.service
starts.
In summary, initrd-cleanup.service
runs the systemctl isolate
command, which in turn seems to be stopping the sysroot.mount
unit (but only sometimes, for some reason), because nothing seems to prevent it from doing so?
In summary,
initrd-cleanup.service
runs thesystemctl isolate
command, which in turn seems to be stopping thesysroot.mount
unit (but only sometimes, for some reason), because nothing seems to prevent it from doing so?
Actually, there may be something that should be preventing it from being unmounted.
So, the full command is systemctl --no-block isolate initrd-switch-root.target
and initrd-switch-root.target
seems to have a Want=initrd-root-fs.target
as a dependency.
The initrd-root-fs.target
unit does not seem to depend on sysroot.mount
unit explicitly, but the systemd.special
man page says the following:
initrd-root-fs.target
systemd-fstab-generator(3) automatically adds dependencies of type Before= to the sysroot.mount unit, which is generated from the kernel command line's root=
setting (or equivalent).
So now there are 2 additional questions:
systemctl isolate
command consider a Before=
option to be a dependency that shouldn't be stopped?systemd-fstab-generator
correctly generating this dependency?Regarding the second question, my kernel command line does not have a root=
parameter, so perhaps this is the issue?
Note that the full log that I provided for this issue is a bit confusing because there are many events that happen twice (such as the activation script running and switching root):
So I hope you don't get confused because of this.
If you only want to look at events that happened up until the failure, search for the Started Emergency Shell
string and ignore any lines after that.
I think I may have found another clue that systemd-fstab-generator
may not be adding the proper dependency between sysroot.mount
and initrd-root-fs.target
, which likely explains why the systemctl isolate
command is stopping sysroot.mount
just before switching the root.
I haven't been able to reproduce this issue with rd.systemd.log_level=debug
yet, but when I boot with this parameter, I see the following messages in my logs by systemd-fstab-generator
:
2023-07-24T01:16:24.597166+00:00 wizylap daemon.debug systemd-fstab-generator[116]: Could not find a root= entry on the kernel command line.
2023-07-24T01:16:24.597170+00:00 wizylap daemon.debug systemd-fstab-generator[116]: Parsing /etc/fstab...
2023-07-24T01:16:24.597176+00:00 wizylap daemon.debug systemd-fstab-generator[116]: Parsing /nix/store/s43k33w1p9dqnb030w5yrwv48pjvr49x-initrd-fstab...
2023-07-24T01:16:24.597180+00:00 wizylap daemon.debug systemd-fstab-generator[116]: Failed to read symlink target for /, using as-is: No such file or directory
2023-07-24T01:16:24.597184+00:00 wizylap daemon.debug systemd-fstab-generator[116]: Found entry what=rpool where=/ type=zfs makefs=no growfs=no pcrfs=no noauto=no nofail=no
Notice how it detects my root filesystem correctly but apparently it detects it with the /
mountpoint and it doesn't find any filesystem as being mounted in /sysroot
(hence why there's only a line with where=/
but no lines with where=/sysroot
).
This is very likely to mean that the root filesystem was found in the parse_fstab()
function, not in the add_sysroot_mount()
function of fstab-generator.c
.
In fact, that's what the log indicates by the message Parsing /nix/store/...initrd-fstab
. This file contains the following:
rpool / zfs x-initrd.mount,relatime 0 0
If my system had a root=
kernel command line parameter, then add_sysroot_mount()
would have printed where=/sysroot
.
So I wonder if the above file should have contained /sysroot
as the mountpoint instead of /
?
You're missing the fact that initrd-switch-root.service
has Wants=initrd-fs.target
, which means all stage 1 FSes, obviously including /sysroot
, should be pulled in and not stopped by isolate
.
So I wonder if the above file should have contained /sysroot as the mountpoint instead of /?
This is added implicitly by nature of being parsed from the file in the SYSTEMD_SYSROOT_FSTAB
environment variable.
With some effort, I've been able to capture some debug data that could help diagnose this issue as well as #244737, which I think are related.
Specifically, with your help, I was able to enter the rescue shell when the initrd-switch-root.service
failed to start. My goal was to get the set of all dependencies between all the units that were known to systemd at that point in time.
Unfortunately, this initrd didn't have systemd-analyze
available, but with some trickery I was able to mount my root filesystem and create a special-purpose chroot that contained all the necessary files in the necessary places.
I have attached a tarball with all this information. Here's the description of the files:
deps.{dot,pdf,png,svg}
- a graphviz dot file (and its conversion to various formats) containing all the dependencies and ordering requirements between all the unitsdeps2.{dot,pdf,png,svg}
- the same as above, but I filtered out the ordering dependencies to make the required dependencies easier to see.deps-colors.txt
- a text file containing the color legend, i.e. what each color means in the above filesplot.{pdf,svg}
- an SVG file (and its PDF conversion) containing a visualization of when units were started and for how long they ran.The deps.dot
file was generated by systemd-analyze dot
and the plot.svg
file was generated by systemd-analyze plot
. The man page describing the latter command contains a few caveats that could lead to misleading conclusions if not taken into account.
I haven't done almost any analysis yet but from my quick look I noticed something which seems suspicious to me, which is better seen in the deps2.{png,pdf}
files: initrd-switch-root.target
has a Wants=initrd-switch-root.service
, but initrd-switch-root.service
also has a Wants=initrd-switch-root.target
.
I'm not sure if this is a bug or not, especially because the upstream systemd
package also contains this cyclic dependency, but I wouldn't be surprised if it were indeed a bug and that it could cause strange issues (e.g. race conditions) in some circumstances...
EDIT: please ignore the tmp-*.mount
units, as these mounts were manually done by me in the emergency shell as part of setting up a proper chroot to make systemd-analyze
work.
Also, I don't think the sysroot.mount
unit should have been there, because when I entered the emergency shell, /sysroot
was unmounted. However, one of the first things I did was mount -t zfs rpool /sysroot
so that I could access the systemd-analyze
binary, so I think that's why sysroot.mount
was recreated...
I'm not sure this has been completely fixed. I'm running 23.11 and I consistently get at every boot the error stated in the first message. Pressing enter to continue actually makes the OS to successfully boot.
@turlando: It looks like the initrd is failing to mount the /nix
filesystem. What does the fileSystems
section of your configuration.nix
look like?
@turlando To be clear, the error message you're seeing is a generic error that applies to any critical failure in initrd, so you're experience a separate issue entirely.
Here's the relevant part of the log:
gen 18 19:37:27 localhost systemd[1]: Mounting /sysroot...
gen 18 19:37:27 localhost systemd[1]: Starting Rollback system root dataset to empty state...
gen 18 19:37:27 localhost systemd[1]: Mounted /sysroot.
gen 18 19:37:27 localhost systemd[1]: Reached target Initrd Root File System.
gen 18 19:37:27 localhost systemd[1]: Mounting /sysroot/nix...
gen 18 19:37:27 localhost systemd[1]: Starting Mountpoints Configured in the Real Root...
gen 18 19:37:27 localhost systemd-sysroot-fstab-check[374]: /sysroot should be mounted in the initrd, will request daemon-reload.
gen 18 19:37:27 localhost systemd[1]: ephemeral-root.service: Deactivated successfully.
gen 18 19:37:27 localhost systemd[1]: Finished Rollback system root dataset to empty state.
gen 18 19:37:27 localhost systemd[1]: Reloading requested from client PID 374 ('systemd-fstab-g') (unit initrd-parse-etc.service)...
gen 18 19:37:27 localhost systemd[1]: Reloading...
gen 18 19:37:27 localhost mount[373]: mount: /sysroot/nix: mount point does not exist.
gen 18 19:37:27 localhost mount[373]: dmesg(1) may have more information after failed mount system call.
When mount
goes to mount /sysroot/nix
, it's finding that the mountpoint doesn't exist. It looks to me like your ephemeral-root.service
is racing with sysroot-nix.mount
. So systemd creates the mountpoint like normal, but before mount
actually occurs, the rollback is reverting sysroot
to an empty directory which is getting rid of the nix
subdirectory.
You should order ephemeral-root.service
before sysroot.mount
. You can either do this by adding x-systemd.after=ephemeral-root.service
to the file system options for /
, or you can put before = ["sysroot.mount"];
on ephemeral-root
.
Thank you @ElvishJerricco. after = ["zfs-import.target"]
in ephemeral-root.service
was not enough. I'm wondering if wantedBy = ["initrd.target"]
is redundant at this point. Thanks again and sorry for the false alarm.
Describe the bug
I just experienced a second boot failure due to the same issue I reported in https://github.com/NixOS/nixpkgs/issues/244737#issuecomment-1646652958.
However, now I have more details about this failure.
Basically, the issue is that when I have the systemd in initrd option enabled, sometimes the boot process stops due to a failure to "start Switch Root":
This never happens unless I enable the systemd in initrd option, and it has happened both times when the only thing I did was enable that option (nothing else changed).
There are quite a few strange things about this failure, some of which I don't know whether they are separate bugs or not:
Cannot open access to console, the root account is locked. See sulogin(8) man page for more details
. I had never experienced such issues regarding logging in as root.Enter
then the boot process actually appears to proceed normally, which is not what I expected./run/wrappers
always fails to mount during boot when the systemd in initrd option is enabled.After I fixed the issue I was having in #244892 I was able to get more log messages. Here's a new screenshot from another boot failure with more verbose logging:
As you can see, it's complaining about
/sysroot
not appearing to be an OS tree, due to theos-release
file being missing.You may also notice that
/sysroot
was unmounted prior to the start of theSwitch Root
unit.At first glance, you'd think this could be the cause of the problem, but when looking at the logs which I've included below, this is what actually appears to happen:
/sysroot
appears to be unmounted sucessfully/sysroot
gets mounted again.Perhaps this suggests a race condition between mounting
/sysroot
and starting theswitch root
unit?NOTE: actually, this may actually not be happening "almost immediately after", please read the first comment in this issue.
/sysroot
tree but it fails due to/sysroot/dev
and/sysroot/run
being busy. However, I'm not sure if this unmount attempt is simply due to theswitch root
failure or whether this is actually what is supposed to happen.NOTE: again, this may not be happening almost immediately after, see first comment.
You can click here to see all the log messages for this boot that mention `sysroot`.
**Please note that the timestamps are misleading:** ``` 2023-07-23T20:15:03.277842+00:00 wizylap daemon.info systemd[1]: Mounting /sysroot... 2023-07-23T20:15:03.277845+00:00 wizylap daemon.info systemd[1]: Mounted /sysroot. 2023-07-23T20:15:03.277858+00:00 wizylap daemon.info systemd-sysroot-fstab-check[680]: /sysroot should be mounted in the initrd, will request daemon-reload. 2023-07-23T20:15:03.277874+00:00 wizylap daemon.info systemd-sysroot-fstab-check[680]: Requesting initrd-fs.target/start/replace... 2023-07-23T20:15:03.277879+00:00 wizylap daemon.info systemd-sysroot-fstab-check[680]: Requesting swap.target/start/replace... 2023-07-23T20:15:03.277979+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/dev/pts... 2023-07-23T20:15:03.277983+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/dev/shm... 2023-07-23T20:15:03.277987+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/nix/store... 2023-07-23T20:15:03.277991+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/proc... 2023-07-23T20:15:03.277995+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/run/keys... 2023-07-23T20:15:03.277999+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/run/wrappers... 2023-07-23T20:15:03.278003+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/sys... 2023-07-23T20:15:03.278048+00:00 wizylap daemon.info systemd[1]: sysroot-dev-pts.mount: Deactivated successfully. 2023-07-23T20:15:03.278052+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/dev/pts. 2023-07-23T20:15:03.278055+00:00 wizylap daemon.info systemd[1]: sysroot-dev-shm.mount: Deactivated successfully. 2023-07-23T20:15:03.278060+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/dev/shm. 2023-07-23T20:15:03.278064+00:00 wizylap daemon.info systemd[1]: sysroot-nix-store.mount: Deactivated successfully. 2023-07-23T20:15:03.278068+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/nix/store. 2023-07-23T20:15:03.278072+00:00 wizylap daemon.info systemd[1]: sysroot-proc.mount: Deactivated successfully. 2023-07-23T20:15:03.278075+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/proc. 2023-07-23T20:15:03.278080+00:00 wizylap daemon.info systemd[1]: sysroot-run-keys.mount: Deactivated successfully. 2023-07-23T20:15:03.278084+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/run/keys. 2023-07-23T20:15:03.278088+00:00 wizylap daemon.info systemd[1]: sysroot-run-wrappers.mount: Deactivated successfully. 2023-07-23T20:15:03.278092+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/run/wrappers. 2023-07-23T20:15:03.278095+00:00 wizylap daemon.info systemd[1]: sysroot-sys.mount: Deactivated successfully. 2023-07-23T20:15:03.278099+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/sys. 2023-07-23T20:15:03.278104+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/dev... 2023-07-23T20:15:03.278107+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/run... 2023-07-23T20:15:03.278127+00:00 wizylap daemon.info systemd[1]: sysroot-dev.mount: Deactivated successfully. 2023-07-23T20:15:03.278131+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/dev. 2023-07-23T20:15:03.278134+00:00 wizylap daemon.info systemd[1]: sysroot-run.mount: Deactivated successfully. 2023-07-23T20:15:03.278138+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot/run. 2023-07-23T20:15:03.278146+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot... 2023-07-23T20:15:03.278169+00:00 wizylap daemon.info systemd[1]: sysroot.mount: Deactivated successfully. 2023-07-23T20:15:03.278173+00:00 wizylap daemon.info systemd[1]: Unmounted /sysroot. 2023-07-23T20:15:03.278279+00:00 wizylap daemon.err @ystemctl[1226]: Failed to switch root: Specified switch root path '/sysroot' does not seem to be an OS tree. os-release file is missing. 2023-07-23T20:15:03.278410+00:00 wizylap daemon.info systemd-sysroot-fstab-check[1239]: /sysroot should be mounted in the initrd, will request daemon-reload. 2023-07-23T20:15:03.278444+00:00 wizylap daemon.info systemd-sysroot-fstab-check[1239]: Requesting initrd-fs.target/start/replace... 2023-07-23T20:15:03.278449+00:00 wizylap daemon.info systemd-sysroot-fstab-check[1239]: Requesting swap.target/start/replace... 2023-07-23T20:15:03.278666+00:00 wizylap daemon.info systemd[1]: Mounting /sysroot... 2023-07-23T20:15:03.278670+00:00 wizylap daemon.info systemd[1]: Mounted /sysroot. 2023-07-23T20:15:03.278752+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/dev... 2023-07-23T20:15:03.278755+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot/run... 2023-07-23T20:15:03.278771+00:00 wizylap daemon.info umount[1778]: umount: /sysroot/dev: target is busy. 2023-07-23T20:15:03.278787+00:00 wizylap daemon.info umount[1779]: umount: /sysroot/run: target is busy. 2023-07-23T20:15:03.278798+00:00 wizylap daemon.notice systemd[1]: sysroot-dev.mount: Mount process exited, code=exited, status=32/n/a 2023-07-23T20:15:03.278802+00:00 wizylap daemon.err systemd[1]: Failed unmounting /sysroot/dev. 2023-07-23T20:15:03.278814+00:00 wizylap daemon.notice systemd[1]: sysroot-run.mount: Mount process exited, code=exited, status=32/n/a 2023-07-23T20:15:03.278818+00:00 wizylap daemon.err systemd[1]: Failed unmounting /sysroot/run. 2023-07-23T20:15:03.278825+00:00 wizylap daemon.info systemd[1]: Unmounting /sysroot... 2023-07-23T20:15:03.278856+00:00 wizylap daemon.info umount[1780]: umount: /sysroot: target is busy. 2023-07-23T20:15:03.278865+00:00 wizylap daemon.notice systemd[1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a 2023-07-23T20:15:03.278869+00:00 wizylap daemon.err systemd[1]: Failed unmounting /sysroot. ```
I've attached the full boot log below.
Steps To Reproduce
All I did was enable systemd in initrd option I mentioned above.
Expected behavior
I expected the boot process to complete without any errors or any requirement to press Enter.
I also expected to be able to login when systemd was in emergency mode, as it suggested I should.
Screenshots
See above.
Additional context
Here's the full boot log (well, up to a certain point), however, please note that the timestamps are misleading:
log.txt
Notify maintainers
cc @ElvishJerricco