Liquorix kernel doesn't work with systemd v255-rc1

meijkl commented 1 year ago

I've installed systemd v255 rc1 some days ago and experienced severe problems (among others: user processes didn't start). I reported the issues to the systemd developers but the surprising result (I use the Liquorix kernel almost exclusively) was that the kernel caused the problems.

Could you please have a look at the issue: https://github.com/systemd/systemd/issues/29985

I asked mbiebl to give me a hint which change in v255 caused the issue and hope he does so.

Looking forward to use your kernel again! Klaus

damentz commented 1 year ago

Is there a way you can find out why the services in the linked issue failed to start? There should be some type of output or log that will indicate the manner in which it can't initialize with v255.

meijkl commented 1 year ago

I've documented the output from several commands and logs under https://github.com/systemd/systemd/issues/29985, some on request from the systemd developers. Could you please check the information and let me know whether further information is required.

So far I have no idea what the root cause of the issue may be. My best guess is that it is related to systemd user services since the user@.services are not active.

Output of "systemctl status user@$(id -u).service": ○ user@0.service - User Manager for UID 0 Loaded: loaded (/usr/lib/systemd/system/user@.service; static) Drop-In: /usr/lib/systemd/system/user@0.service.d └─10-login-barrier.conf Active: inactive (dead) Docs: man:user@.service(5)

and "systemctl --user status pipewire.service": Failed to get properties: Process org.freedesktop.systemd1 exited with status 1

More information is available under the link to the issue mentioned above.

Hope this helps! Klaus

meijkl commented 1 year ago

See also: root@my-i7506:/home/my# systemctl --state=failed --all UNIT LOAD ACTIVE SUB DESCRIPTION ● cxl-monitor.service loaded failed failed CXL Monitor Daemon ● ndctl-monitor.service loaded failed failed Ndctl Monitor Daemon ● systemd-timesyncd.service loaded failed failed Network Time Synchronization ● user@1000.service loaded failed failed User Manager for UID 1000 ● user@133.service loaded failed failed User Manager for UID 133 ● wpa_supplicant.service loaded failed failed WPA supplicant ● wsdd.service loaded failed failed Web Services Dynamic Discovery host daemon

damentz commented 1 year ago

Over the last couple of releases today, I synchronized the security settings of Liquorix for Debian with Debian itself, and included an upcoming API that v255-rc1 added support for that seems related. It doesn't appear that this has resolved the issue as the maintainers of systemd would have hoped.

I actually think this is a systemd bug, but because they insist, we'll probably need to wait until this version of systemd hits a stable release. Most likely some Arch users will run into the problem and there'll be some hotfixes shortly after. Or, there is an issue with Liquorix I'm completely blind to and I'll need to take care of it before then.

Either way, I'll confirm if this is still an issue when I release a v6.6 kernel as I'm able to reproduce easily on a Debian Unstable (with the experimental packages).

meijkl commented 1 year ago

Thanks for the information! I've tried to understand the changes in systemd but honestly I'm simply to far away from any kernel development. So I'm depending on your capabilities :-) I'll keep my fingers crossed!

meijkl commented 1 year ago

FYI - have you seen the latest comment by mbiebl: https://github.com/systemd/systemd/issues/29985#issuecomment-1807357863 https://github.com/systemd/systemd/commit/adecfb3bc0be0def49433277fcad5333893756cc

damentz commented 1 year ago

Yes, thanks for linking.

Unless my addition of fchmodat2 was incorrect, I think something really odd may be going on with their new implementation.

assert(fd >= 0);

if (fchmodat2(fd, "", m, AT_EMPTY_PATH) >= 0)
        return 0;
if (!IN_SET(errno, ENOSYS, EPERM)) /* Some container managers block unknown syscalls with EPERM */
        return -errno;

My theory is that their additional assertion and the test for other errors probably caused the issue. Also maybe I'm crazy, but it's odd that they're using fchmodat2 without testing if it exists. It's only a new syscall that was added to 6.6 just recently (and backported to 6.5 Liquorix recently).

YHNdnzj commented 1 year ago

My theory is that their additional assertion and the test for other errors probably caused the issue.

Assertion failure would immediately terminate the process, no? And fd needs to be valid anyway.

I can't seem to figure out what "test for other errors" is supposed to mean.

Also maybe I'm crazy, but it's odd that they're using fchmodat2 without testing if it exists. It's only a new syscall that was added to 6.6 just recently (and backported to 6.5 Liquorix recently).

We use newly-added syscalls all the time, as long as proper fallback path is in place. Here we fall back to going through /proc/self/fd/ if fchmodat2 returns the two errors that indicates it's not supported.

damentz commented 1 year ago

Assertion failure would immediately terminate the process, no? And fd needs to be valid anyway.

I haven't looked at the surrounding code but I figured there would be some type of try/catch to the invocation do_fchmod. But you're right, what's the point if you have an invalid file descriptor.

We use newly-added syscalls all the time, as long as proper fallback path is in place. Here we fall back to going through /proc/self/fd/ if fchmodat2 returns the two errors that indicates it's not supported.

Thanks for the clarification. It wasn't clear to me how error handling for unknown syscalls occurred; it looks like the error is simply captured and you decide what to do with it, but it doesn't terminate execution at that moment (unless I understood that wrong too).

YHNdnzj commented 1 year ago

It wasn't clear to me how error handling for unknown syscalls occurred; it looks like the error is simply captured and you decide what to do with it, but it doesn't terminate execution at that moment (unless I understood that wrong too).

Sorry, but I failed to interpret this. But the logic is that if an error other than ENOSYS and EPERM is returned, we bail out immediately. Otherwise, we continue to try /proc/self/fd/ instead.

damentz commented 1 year ago

@meijkl can you try the latest kernel? I pushed out the first version on v6.6.1 and I'm able to boot to systemd v255-rc2 with Liquorix.

meijkl commented 1 year ago

Just installed the 6.6.1 version and it is booting without problems! Thanks for your effort! Klaus

meijkl commented 1 year ago

P.S.: I've tried to understand where the most important changes are located - is it " Update patch to v6.5.11-lqx2"?

damentz commented 1 year ago

Glad it's working, this means we don't need to worry about the next stable release.

As far as what may have caused it, on the 6.6/master branch, I committed: https://github.com/damentz/liquorix-package/commit/82d906c10a878470f0f9a77155a34e97f7f679fc

This added the CPU shares stub that systemd can use. Though, it's probably unrelated and something about how out-of-tree code was merged into the 6.5 branch for Zen Kernel must have affected the code changes.

meijkl commented 1 year ago

Thanks for the explanation, that really required only a small but obviously vital change. Having myself a background in ERP and business processes it is nonetheless always interesting to monitor the development in other areas. But for now I can simply focus on using your kernel. ;-)

damentz / liquorix-package

Liquorix kernel doesn't work with systemd v255-rc1 #147