elementary / greeter

Login and Lock Screen greeter for elementary OS and Pantheon, using LightDM
GNU General Public License v3.0
152 stars 39 forks source link

Greeter does not load correctly on boot #578

Closed damiante closed 2 years ago

damiante commented 2 years ago

What Happened?

When booting, some issue occurs between lightdm and greeter. I often see Greeter load with the lock screen for a brief moment, but it will crash to a black screen with a blinking cursor or my motherboard's manufacturer logo. From this screen TTY can be enabled with CTRL + ALT + F2 and restarting lightdm can solve the issue but does not consistently do so. This occurs on almost all cold boots and most reboots. After successfully launching the service everything works correctly.

Please refer to this stack overflow topic: https://elementaryos.stackexchange.com/questions/29090/how-to-make-lightdm-more-stable-on-boot There are some logs provided in the topic

Initial suspicion is that this is config-related but I don't have much to go on. Happy to provide additional logs if asked.

Steps to Reproduce

  1. Boot system
  2. Usually, I will land on a screen with my motherboard's logo
  3. Enter TTY with CTRL + ALT +F2 and check /var/log/syslog for evidence that lightdm has failed on startup
  4. do sudo service lightdm restart; this may require several attempts
  5. Greeter initialises and the OS/GUI can be used as normal

Expected Behavior

The system correctly launches lightdm + greeter on every boot

OS Version

6.x (Odin)

Software Version

Latest release (I have run all updates)

Log Output

Contents of doing `sudo lightdm --test-mode --debug`: 
https://wtools.io/paste-code/b75Y

Additional logs provided on request

Hardware Info

output of lshw -short

H/W path                  Device          Class          Description
====================================================================
                                          system         System Product Name (SKU)
/0                                        bus            TUF GAMING B550M-PLUS
/0/0                                      memory         64KiB BIOS
/0/2c                                     memory         16GiB System Memory
/0/2c/0                                   memory         Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME 
/0/2c/1                                   memory         8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
/0/2c/2                                   memory         Project-Id-Version: lshwReport-Msgid-Bugs-To: FULL NAME 
/0/2c/3                                   memory         8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2400 MHz (0.4 ns)
/0/2f                                     memory         384KiB L1 cache
/0/30                                     memory         3MiB L2 cache
/0/31                                     memory         32MiB L3 cache
/0/32                                     processor      AMD Ryzen 5 5600X 6-Core Processor
/0/100                                    bridge         Starship/Matisse Root Complex
/0/100/0.2                                generic        Starship/Matisse IOMMU
/0/100/1.2                                bridge         Starship/Matisse GPP Bridge
/0/100/1.2/0                              bus            Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0/0            usb1            bus            xHCI Host Controller
/0/100/1.2/0/0/6                          input          AURA LED Controller
/0/100/1.2/0/0/7                          bus            USB2.0 Hub
/0/100/1.2/0/0/7/4                        communication  Bluetooth wireless interface
/0/100/1.2/0/0/8                          input          USB Multimedia Keyboard
/0/100/1.2/0/0/9                          input          USB Optical Mouse
/0/100/1.2/0/1            usb2            bus            xHCI Host Controller
/0/100/1.2/0.1                            storage        Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0.2                            bridge         Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0.2/0                          bridge         Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0.2/4                          bridge         Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0.2/4/0                        storage        NVMe SSD Controller SM981/PM981/PM983
/0/100/1.2/0.2/4/0/0      /dev/nvme0      storage        Samsung SSD 970 EVO Plus 1TB
/0/100/1.2/0.2/4/0/0/1    /dev/nvme0n1    disk           1TB NVMe namespace
/0/100/1.2/0.2/4/0/0/1/1  /dev/nvme0n1p1  volume         263MiB Windows FAT volume (NB: Is this the Elementary boot partition? I don't think I've ever installed Windows on this disk, only booted from an image to it)
/0/100/1.2/0.2/4/0/0/1/2  /dev/nvme0n1p2  volume         931GiB EFI partition
/0/100/1.2/0.2/8                          bridge         Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0.2/8/0        wlp5s0          network        Wi-Fi 6 AX200
/0/100/1.2/0.2/9                          bridge         Advanced Micro Devices, Inc. [AMD]
/0/100/1.2/0.2/9/0        enp6s0          network        RTL8125 2.5GbE Controller
/0/100/3.1                                bridge         Starship/Matisse GPP Bridge
/0/100/3.1/0                              display        NVIDIA Corporation
/0/100/3.1/0.1                            multimedia     NVIDIA Corporation
/0/100/7.1                                bridge         Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
/0/100/7.1/0                              generic        Starship/Matisse PCIe Dummy Function
/0/100/8.1                                bridge         Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
/0/100/8.1/0                              generic        Starship/Matisse Reserved SPP
/0/100/8.1/0.1                            generic        Starship/Matisse Cryptographic Coprocessor PSPCPP
/0/100/8.1/0.3                            bus            Matisse USB 3.0 Host Controller
/0/100/8.1/0.3/0          usb3            bus            xHCI Host Controller
/0/100/8.1/0.3/0/4                        multimedia     C922 Pro Stream Webcam
/0/100/8.1/0.3/1          usb4            bus            xHCI Host Controller
/0/100/8.1/0.4                            multimedia     Starship/Matisse HD Audio Controller
/0/100/14                                 bus            FCH SMBus Controller
/0/100/14.3                               bridge         FCH LPC Bridge
/0/101                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/102                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/103                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/104                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/105                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/106                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/107                                    bridge         Starship/Matisse PCIe Dummy Host Bridge
/0/108                                    bridge         Matisse Device 24: Function 0
/0/109                                    bridge         Matisse Device 24: Function 1
/0/10a                                    bridge         Matisse Device 24: Function 2
/0/10b                                    bridge         Matisse Device 24: Function 3
/0/10c                                    bridge         Matisse Device 24: Function 4
/0/10d                                    bridge         Matisse Device 24: Function 5
/0/10e                                    bridge         Matisse Device 24: Function 6
/0/10f                                    bridge         Matisse Device 24: Function 7
/0/1                                      system         PnP device PNP0c01
/0/2                                      system         PnP device PNP0c02
/0/3                                      system         PnP device PNP0b00
/0/4                                      system         PnP device PNP0c02
/0/5                                      communication  PnP device PNP0501
/0/6                                      system         PnP device PNP0c02
/0/7                      scsi3           storage        
/0/7/0.0.0                /dev/sda        disk           2TB ST2000DM008-2FR1
/0/7/0.0.0/1              /dev/sda1       volume         1863GiB EXT4 volume
/1                        virbr0-nic      network        Ethernet interface
damiante commented 2 years ago

Every time I boot with this issue I have another look and see if I can find some more info :)

This time I found the following snippet in /var/log/syslog:

Nov  7 10:05:09 PorygonZ systemd[1]: Started Session c2 of user lightdm.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Supervising 3 threads of 1 processes of 1 users.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Successfully made thread 1699 of process 1437 owned by '1
14' RT at priority 5.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Supervising 4 threads of 1 processes of 1 users.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Supervising 3 threads of 1 processes of 1 users.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Successfully made thread 1700 of process 1437 owned by '1
14' RT at priority 5.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Supervising 4 threads of 1 processes of 1 users.
Nov  7 10:05:09 PorygonZ kernel: [   11.706589] show_signal: 30 callbacks suppressed
Nov  7 10:05:09 PorygonZ kernel: [   11.706591] traps: io.elementary.g[1663] general protection fault 
ip:7fbd496d1160 sp:7ffc87500c58 error:0 in libgobject-2.0.so.0.6400.6[7fbd496a3000+36000]
Nov  7 10:05:09 PorygonZ acpid: client 1629[0:0] has disconnected
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Supervising 3 threads of 1 processes of 1 users.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Successfully made thread 1729 of process 1437 owned by '114' RT at priority 5.
Nov  7 10:05:09 PorygonZ rtkit-daemon[1446]: Supervising 4 threads of 1 processes of 1 users.
Nov  7 10:05:09 PorygonZ systemd[1]: lightdm.service: Main process exited, code=exited, status=1/FAILURE
Nov  7 10:05:09 PorygonZ systemd[1]: lightdm.service: Failed with result 'exit-code'.
Nov  7 10:05:10 PorygonZ systemd[1]: lightdm.service: Scheduled restart job, restart counter is at 2.
Nov  7 10:05:10 PorygonZ systemd[1]: Stopped Light Display Manager.
Nov  7 10:05:10 PorygonZ systemd[1]: Starting Detect the available GPUs and deal with any system changes...

I think the smoking gun here is: traps: io.elementary.g[1663] general protection fault ip:7fbd496d1160 sp:7ffc87500c58 error:0 in libgobject-2.0.so.0.6400.6[7fbd496a3000+36000]

Is there a way I can use these IDs to trace exactly what the issue is? Perhaps my GPU isn't initialising before lightdm tries to start? I'd find that a little surprising as I can see the greeter screen flicker to life briefly before crashing, one imagines that if my GPU was uninitialised it wouldn't be able to render at all (I don't have onboard graphics on my CPU).

damiante commented 2 years ago

Done even more digging with help from the ArchWiki's article on LightDM in /var/log/lighdm/seat0-greeter.log the following 2 errors are repeated over and over:

(io.elementary.greeter:1440): Handy-WARNING **: 07:38:01.609: Failed to load icon: Error opening file /home/damian/.face: No such file or directory

** (io.elementary.greeter:1440): CRITICAL **: 07:38:01.633: MainWindow.vala:358: prompt: `Password: ' (1)

I gathered that the first issue is related to not having a profile icon set; I went and set this in the system settings and also manually placed a 96x96 png file there for good measure after saving my system settings didn't appear to generate the file. I don't think this will solve the problem but I'll let you know if it does on my next boot :)

The second issue seems more insidious and is also more cryptic. Again, just looks like Greek to me as I don't know what to do with these numbers. My guess is that io.elementary.greeter:1440 is a PID, 07:38:01.633 is probably a timestamp, but what does MainWindow.vala:358 mean and why is it critical?

thmey commented 2 years ago

@damiante ** (io.elementary.greeter:1440): CRITICAL **: 07:38:01.633: MainWindow.vala:358: prompt:Password: ' (1)` this only seems to be a log message. https://github.com/elementary/greeter/blob/dd932147a1074c053fa21d163c18808877482702/src/MainWindow.vala#L357-L371

I think you are right here:

I think the smoking gun here is: traps: io.elementary.g[1663] general protection fault ip:7fbd496d1160 sp:7ffc87500c58 error:0 in libgobject-2.0.so.0.6400.6[7fbd496a3000+36000]

This explains how to interpret those addresses https://stackoverflow.com/questions/60275576/how-to-interpret-linux-kernel-trap-error-line

I already tried to get the function name via addr2line but it only return ??:0, event with libglib2.0-0-dbgsym installed.

@danrabbit I am not experienced with debugging vala or lightdm. Do you have any hints on how to tackle this? The only option I currently see is git bisect, which might take a very long time, since I can only reproduce this bug on my productive system.

Also #568 seems to be the same issue.

damiante commented 2 years ago

Thanks for your reply @thmey , I'll dig into this tomorrow. Interestingly I also have a multiple-monitor setup, which seems to be a common thread. Perhaps there is some misbehaviour associated with multi-mon on boot? I may also try booting with just one monitor and seeing if I can reliably reproduce the issue that way.

thmey commented 2 years ago

6f18c79c780582e43039032f6926816efa82e206 seems to cause the issue.

damiante commented 2 years ago

Hmm, I don't really understand the context of this issue. If this is the cause, is it something I can fix? Seems like I'd need to PR -> Merge -> remote build -> OS update, and that's assuming I knew what the issue was and how to fix it :P

thmey commented 2 years ago

@damiante

Seems like I'd need to PR -> Merge -> remote build -> OS update, and that's assuming I knew what the issue was and how to fix it :P

Thats right, but I think the issue is related to custom gtk themes. Any chance you installed custom themes? I just removed all custom themes and the issue seems to be gone. I ran sudo service lightdm restart multiple times to check.

damiante commented 2 years ago

I've never installed a custom gtk theme myself so if any of these count as "custom" then they must have come with something else unrelated that I've installed. Since I don't know what they're used for I'm not game to remove them unless I have to

damian@PorygonZ:~$ ls /usr/share/themes
Default                             io.elementary.stylesheet.lime
Emacs                               io.elementary.stylesheet.mint
io.elementary.stylesheet.banana     io.elementary.stylesheet.orange
io.elementary.stylesheet.blueberry  io.elementary.stylesheet.slate
io.elementary.stylesheet.bubblegum  io.elementary.stylesheet.strawberry
io.elementary.stylesheet.cocoa      Raleigh
io.elementary.stylesheet.grape

Incidentally I disconnected my second monitor and rebooted hot this morning and it worked when the cold boot with two monitors failed. Could be confirmation bias though :)

thmey commented 2 years ago

Those are the default themes. I might be wrong though regarding the themes causing the issue, since I both reinstalled the greeter and removed themes after git bisect.

@damiante could you confirm that the issue occurs at commit 6f18c79 and not before? You would have to compile and install the greeter yourself.

damiante commented 2 years ago

Not super confident with this but I presume after compiling I need to change the binary path somewhere in my config; doesn't seem to be in /usr/share/lightdm/lightdm.conf.d/40-io.elementary.greeter.conf (refers to a .desktop file as I recall) or /usr/share/xsessions/pantheon.desktop (refers to something called io.elementary.wingpanel). If you tell me what I need to replace with the compiled binary path I'll give it a crack.

thmey commented 2 years ago

To compile and install: meson build --prefix=/usr cd build ninja sudo ninja install

After that run sudo lightdm restart to test the greeter.

782191d1ed6bc9078d2ff3704b331699eb74702f This commit should not contain the bug and this 6f18c79c780582e43039032f6926816efa82e206 should contain it.

After testing run sudo apt install --reinstall io.elementary.greeter

damiante commented 2 years ago

@thmey you are spot on! I can't believe you managed to work it out and I'm so pleased! I can restart lightdm as much as I like on that build and it works fine, but as soon as I reinstall from apt and restart it breaks. This appears to be the problem. Thank you so much!

thmey commented 2 years ago

@damiante have you checked both commits to verify my findings? Unfortunately I don't know how whats causing the issue and how to fix it.

damiante commented 2 years ago

@thmey yes I built and swapped the installs of both and 782191d1ed6bc9078d2ff3704b331699eb74702f behaves as I would expect but 6f18c79c780582e43039032f6926816efa82e206 produces the issue. I guess for now I can keep using the 782191d1ed6bc9078d2ff3704b331699eb74702f version since actually consistently loading my GUI when I boot is more important to me than whatever features were added in between :)

topatlant commented 2 years ago

I'm having the same problem here, even without multiple monitors and no hybrid graphics / multiseat setup, I only have an integrated GPU.

My HW is Intel Skylake GT2 [HD Graphics 520] on a Thinkpad T460...

My experience is consistent with your descriptions, the loops seem to have started after the update of greeter to 6.0.1 came in. When waiting really long, the greeter loop eventually stops (sth like 40-100 restarts or so) and I have many gsd-sound and gsd-a11y-settings processes left running...

marmeladapk commented 2 years ago

I can confirm that reverting 6f18c79 fixes the problem.

oddveryodd commented 2 years ago

I poked around with this even though I am not too knowledgeable of much of anything here. I believe this is a use-after-free issue where logged_in_context is used in update_style(); commenting out that use causes the issue to go away for me.

6f18c79 adds a new var logged_in which is local to the construct{} function. logged_in.get_style_context() is saved as a class private, but I don't see where logged_in itself is saved. I changed logged_in to be a class private and I am no longer able to reproduce this issue. Someone who knows what they are doing should weigh in; I am happy to be wrong.

diff --git a/src/Cards/UserCard.vala b/src/Cards/UserCard.vala
index 83df22c..8cd917e 100644
--- a/src/Cards/UserCard.vala
+++ b/src/Cards/UserCard.vala
@@ -42,6 +42,7 @@ public class Greeter.UserCard : Greeter.BaseCard {
     private Gtk.Stack login_stack;
     private Greeter.PasswordEntry password_entry;

+    private SelectionCheck logged_in;
     private unowned Gtk.StyleContext logged_in_context;
     private weak Gtk.StyleContext main_grid_style_context;
     private weak Gtk.StyleContext password_entry_context;
@@ -214,7 +215,7 @@ public class Greeter.UserCard : Greeter.BaseCard {
         };
         avatar_overlay.add (avatar);

-        var logged_in = new SelectionCheck () {
+        logged_in = new SelectionCheck () {
             halign = Gtk.Align.END,
             valign = Gtk.Align.END
         };
bobby285271 commented 2 years ago

This issue also happens on NixOS as reported in https://github.com/NixOS/nixpkgs/issues/151609, and I can confirm the above patch fixes the issue on our side. @oddveryodd Do you mind file a pull request at your convenience?