ErnWong / dotfiles-old-divnix

Ironically, not many files here are actually dotfiles. Bootstrapped from https://github.com/divnix/devos
MIT License
0 stars 0 forks source link

Laptop system is cooked - Polkit crash on start #12

Closed ErnWong closed 2 years ago

ErnWong commented 2 years ago

Sounds like a hardware problem... but we'll need to check. Had a weird issue where the machine went back to UEFI mode even though last time I've set it to legacy bios mode. Also had a problem where batter is not detected, but this time the battery icon doesn't even show up.

it's cooked 👨‍🍳 😕

ErnWong commented 2 years ago

Yes, I'm (ab) using my dotfiles repo as a hardware maintenance/troubleshooting diary.

ErnWong commented 2 years ago

Possibly unrelated, but also another error I see when journalctl --user:

kglobalaccel5[23970]: Invalid MIT-MAGIC-COOKIE-1 keyqt.qpa.xcb: could not connect to display :0
kglobalaccel5[23970]: qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
kglobalaccel5[23970]: This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.
kglobalaccel5[23970]: Available platform plugins are: wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, xcb.
...
systemd[1120]: plasma-kglobalaccel.service: Main process exited, code=dumped, status=6/ABRT
systemd[1120]: plasma-kglobalaccel.service: Failed with results 'core-dump'.
systemd[1120]: Failed to start KDE Global Shortcuts Server
ErnWong commented 2 years ago
org.kde.LogoutPrompt[22572]: file:///nix/store/[long hash lol]-plasma-workspace-5.21.5/share/plasma/look-and-feel/org.kde.breeze.desktop/contents/components/UserDelegate.qml:34: ReferenceError: model is not defined
org.kde.LogoutPrompt[22572]: qt.svg <input>:406:376: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:407:130: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:408:130: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:408:393: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:409:130: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:410:129: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:411:129: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:412:129: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:413:129: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:413:379: Could not add child element to parent element because the types are incorrect.
org.kde.LogoutPrompt[22572]: qt.svg <input>:413:631: Could not add child element to parent element because the types are incorrect.
ErnWong commented 2 years ago

There are so much more errors that I don't want to copy out. I should install a browser or install discord to transfer the error logs across. I should also look at the logs and test out the behaviour on the live USB.

ErnWong commented 2 years ago

Trying to access the boot menu but F12 no longer seems to trigger it. Changing the boot order in the legacy bios menu doesn't seem to work, and neither does changing the USB port to the one I used to install the OS in the first place.

ErnWong commented 2 years ago

Trying to reboot via reboot from Konsole:

> reboot

Failed to set wall message, ignoring: Connection timed out
Failed to reboot system via logind: Conenction timed out
Failed to talk to init daemon.

~ took 1m27s
ErnWong commented 2 years ago
> sudo reboot

We trust you have received the usual lecture from the local System Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for nixos:

That worked...

Hey, boot menu works this time, not sure why. Probably it was bad timing or wrong key press last time. I'm not sure which key press activated it this time since I pressed multiple things.

ErnWong commented 2 years ago

From live USB: There is a slight pause of blank text screen between boot log and the login screen showing up, but logging in is very fast this time.

Starting up the menu, I see the sleep, hibernate, restart and shut down buttons visible in the menu bar. Why weren't they visible in the installed OS?

Dolphin's window decoration is grey as expected (whereas it turned blue in the installed OS). Within dolphin, opening a hard drive that requires authentication works as expected - it opens up an authentication required polkit dialog box (whereas the installed OS just failed as authentication manager wasn't running).

Logging out is really fast (unlike the installed OS where it freezes for a very long while before jumping to the log out screen).

After logging back in, the kde plasma window decorators has turned blue (active) and black (unfocused)??? But the shut down buttons are still available.

Hmm. the cursor is glitchy - raised a new bug #14

dmesg: I still see nouveau having MMIO read faults, but I don't see authorisation manager problems.

journalctl --user:

  1. It still see some similar errors (kglobalaccel5, invalid mit-magic-cookie could not connect to display, logoutprompt)

The difference between installed version and live USB version is that root has a password in the installed version while it is unset in the live USB version... perhaps that could be a reason? But that seems unreasonable - surely our system works when root has a password.

ErnWong commented 2 years ago

Diagnosing the problem with "leave" not working...

Opened up Konsole, and ran journalctl --follow to map my actions to error messages. No new error messages. Right click -> Leave (Ctrl alt del) Kde does not respond, but Konsole's journalctl shows Failed to start Authorization Manager.

Transcribing the logs by manually typing it here: (could contain typos)

Dec 09 09:16:20 NixOS dbus-daemon[1173]: [session uid=10000 pid=1173] Activating service name='org.kde.LogoutPrompt' requested by ':1.17' (uid=1000 pid=1245 comm="/nix/store/c9134bkfplw85bfj5dwfxbah6hjm6vnm-plasma" label="kernel")
Dec 09 09:16.21 NixOS dbus-daemon[810]: [system] Activating via systemd: service name='org.freedesktop.PolicyKit1' unit='polkit.service' requrested by ':1.5' (uid=0 pid=833 comm="/nix/store/71lqc2a8cslg4wxj6ypla7gvflphjhn0-system" label="kernel")
Dec 09 09:16:21 NixOS systemd[1]: Starting Authorization Manager...
Dec 09 09:16:21 NixOS polkitd[1683]: Started polkit version 0.118
Dec 09 09:16:21 NixOS kernel: polkitd[1683]: segfault at 7f00e5b5e9e8 ip 00007f00e52aefc0 sp 00007ffed8015888 error 7 in libmozjs-78.so[7f00e515e000+9f3000]
Dec 09 09:16:@1 NixOS kernel: Code: <A bunch of zeros - no I'm not going to type it all out>
Dec 09 09:16:21 NixOS systemd[1]: Started Process Core Dump (PID 1685/UID 0).
Dec 09 09:16:21 NixOS systemd-coredump[1686]: Cannot resolve systemd-coredump user. Proceeding to dump core as root: No such process
Dec 09 09:16:21 NixOS systemd-coredump[1686]: [↗] Process 1683 (polkitd) of user 28 dumped core.
Dec 09 09:16:21 NixOS systemd[1]: polkit.service: Main process exited, code=dumped, status=11/SEGV
Dec 09 09:16:21 NixOS systemd[1]: polkit.service Failed with result 'core-dump'.
Dec 09 09:16:21 NixOS systemd[1]: Failed to start Authorization Manager.
Dec 09 09:16:21 NixOS systemd[1]: systemd-coredump@21-1685-0.service: Succeeded.
Dec 09 09:16:45 NixOS dbus-daemon[1173]: [session uid=1000 pid=1173] Successfully activated service 'org.kde.LogoutPrompt'
Dec 09 09:16:46 NixOS dbus-daemon[810]: [system] Failed to activate service 'org.freedesktop.PolicyKit1': timed out (service_start_timeout=25000ms)

(Then activating via systemd for polkit repeats several times)

Clearly, polkit's segfault is the one causing or contributing to most of the delay.

Next step: figure out why polkit is crashing. Can we run a debug build of polkit? Do we have debugging symbols? Could I just run gdb? No gdb installed :( #15

ErnWong commented 2 years ago

Searching the segfault error code 7:

image

This could be relevant, since nixos like to make certain areas readonly.

ErnWong commented 2 years ago

or not

ErnWong commented 2 years ago

Another possibility is to look into downgrading to some old stable version of something

ErnWong commented 2 years ago

https://github.com/ErnWong/dotfiles/commit/09bb4783d8fb0df4f1966b400066e6b1faf77087

Now that we finally have git configured and have debugging symbols and gdb, we can finally try running polkitd with gdb and see what happens.

ErnWong commented 2 years ago
dotfiles  main 「📁 」  ⎔ 
✖1 ❯ sudo gdb /nix/store/yg6l5s5y5wdkg5kj4if97pbcjmafqknx-polkit-0.118/lib/polkit-1/polkitd
[sudo] password for nixos: 
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /nix/store/yg6l5s5y5wdkg5kj4if97pbcjmafqknx-polkit-0.118/lib/polkit-1/polkitd...
(No debugging symbols found in /nix/store/yg6l5s5y5wdkg5kj4if97pbcjmafqknx-polkit-0.118/lib/polkit-1/polkitd)
(gdb) run
Starting program: /nix/store/yg6l5s5y5wdkg5kj4if97pbcjmafqknx-polkit-0.118/lib/polkit-1/polkitd 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/gk42f59363p82rg2wv2mfy71jn5w4q4c-glibc-2.32-48/lib/libthread_db.so.1".
Successfully changed to user polkituser
[New Thread 0x7ffff44e4640 (LWP 94386)]
[Thread 0x7ffff44e4640 (LWP 94386) exited]

Thread 1 "polkitd" received signal SIGSEGV, Segmentation fault.
0x00007ffff7080fc0 in js::coverage::InitLCov() ()
   from /nix/store/6g0x5c8vrnc7s71h2v8g461rrxr4pfa1-spidermonkey-78.11.0/lib/libmozjs-78.so
(gdb) backtrace full
#0  0x00007ffff7080fc0 in js::coverage::InitLCov() ()
   from /nix/store/6g0x5c8vrnc7s71h2v8g461rrxr4pfa1-spidermonkey-78.11.0/lib/libmozjs-78.so
No symbol table info available.
#1  0x00007ffff70c4535 in JS::detail::InitWithFailureDiagnostic(bool) ()
   from /nix/store/6g0x5c8vrnc7s71h2v8g461rrxr4pfa1-spidermonkey-78.11.0/lib/libmozjs-78.so
No symbol table info available.
#2  0x00007ffff7dac5e2 in g_type_class_ref ()
   from /nix/store/9jvzb0zwl093dwj3i12ls068k4dv911z-glib-2.68.2/lib/libgobject-2.0.so.0
No symbol table info available.
#3  0x00007ffff7d961d8 in g_object_new_with_properties ()
   from /nix/store/9jvzb0zwl093dwj3i12ls068k4dv911z-glib-2.68.2/lib/libgobject-2.0.so.0
No symbol table info available.
#4  0x00007ffff7d96b71 in g_object_new ()
   from /nix/store/9jvzb0zwl093dwj3i12ls068k4dv911z-glib-2.68.2/lib/libgobject-2.0.so.0
No symbol table info available.
#5  0x000000000040d75a in polkit_backend_authority_get ()
No symbol table info available.
#6  0x000000000040be8e in main ()
No symbol table info available.
(gdb) 
ErnWong commented 2 years ago

The source code appears to be https://searchfox.org/mozilla-central/rev/dfc0dea63a16b73078a46b6ae49b2a626b8c11b5/js/src/vm/CodeCoverage.cpp#499-504

void InitLCov() {
  const char* outDir = getenv("JS_CODE_COVERAGE_OUTPUT_DIR");
  if (outDir && *outDir != 0) {
    EnableLCov();
  }
}

Possible segfault locations:

  1. outDir , but clearly the outDir pointer is guarded.
  2. EnableLCov(), but self should still be in scope and available...
  3. getenv, but can getenv segfault??
ErnWong commented 2 years ago

Might just try upgrading polkit. Currently on polkit 0.118 (Sept 8 2020, https://gitlab.freedesktop.org/polkit/polkit/-/blob/master/NEWS) and libmozjs-78.

ErnWong commented 2 years ago

The strange thing is that segfault error 7 is an attempt to write to a region that is readonly, but I don't see any writes apart from the initialization of outDir.

ErnWong commented 2 years ago

The latest version available is polkit 0.120

https://github.com/NixOS/nixpkgs/blob/64065d76f434457073f5d255a3246658119e08ed/pkgs/development/libraries/polkit/default.nix#L41

ErnWong commented 2 years ago

Just finished updating to unstable nixos packages, and now polkit is running on version 0.120 without crashing (for now). Will now restart the system to see how everything works.

For some reason, activating home manager failed because apparently /home/nixos/.config/fontconfig/conf.d/10-hm-fonts.conf was in the way. Did kde override the file in some way? Not sure. Moved it out to the dotfiles directory (unstaged) for now.

ErnWong commented 2 years ago

Sweet, kde plasma now loads pretty quickly without problems, and the power buttons are finally back!

ErnWong commented 2 years ago

Hmm... krunner doesn't show up with alt+space, but it does work when I start typing in the desktop, and I do see the shortcut "alt+space" configured in the system settings.

Closing this issue, as the original polkit problem seems solved.

ErnWong commented 2 years ago

Fix version: https://github.com/ErnWong/dotfiles/commit/9fa793fa562733c316a017b82e22ebd5e0c5abf3