ValveSoftware / SteamOS

SteamOS community tracker
1.58k stars 70 forks source link

Alienware Steam Machine freezing up in a kernel panic when Steam starts #685

Open sharkwouter opened 5 years ago

sharkwouter commented 5 years ago

Your system information

Please describe your issue in as much detail as possible:

When booting SteamOS on my Alienware Steam Machine (I think it is the first version, it has a GTX 860) the entire system seems to freeze for a while (at least 20 seconds, I didn't time it) when Steam has started. During this time all my USB peripherals are shut off. All the lights on my keyboard go out, my wireless mouse stops working and my Xbox 360 receiver goes black as well.

After the freeze, the system seems to work like normal. The dmesg output seems to suggest it is a kernel panic related to the driver for the colored lights on the system. You can find the output here: https://pastebin.com/4tx6d3bP

In the Alien FX menu changing the lights works like normal, though. This is an Alienware Steam Machine with and i3, 8 gigs of memory and I replaced the hard disk with an SSD.

Steps for reproducing this issue:

  1. Press the on button
  2. Wait for the startup animation of Steam to finish
  3. The system freezes
3vi1 commented 5 years ago

I turned my Alienware Steam Machine on for the first time in nearly a year and am seeing similar symptoms. However, mine doesn't work normal after the freeze returns... mine reboots after I login - sometimes even before I login. :\

Oct  5 19:37:23 steamos systemd[1]: Started Light Display Manager.
Oct  5 19:37:24 steamos kernel: [  458.394165] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window]
Oct  5 19:37:24 steamos kernel: [  458.394319] caller _nv001094rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
Oct  5 19:37:24 steamos acpid: client connected from 2207[0:0]
Oct  5 19:37:24 steamos acpid: 1 client rule loaded
Oct  5 19:37:24 steamos systemd[1]: Starting Session 2 of user steam.
Oct  5 19:37:24 steamos systemd[1]: Started Session 2 of user steam.
Oct  5 19:37:37 steamos kernel: [  463.777985] ------------[ cut here ]------------
Oct  5 19:37:37 steamos kernel: [  463.777990] kernel BUG at /usr/src/packages/BUILD/mm/slub.c:3904!
Oct  5 19:37:37 steamos kernel: [  463.777995] invalid opcode: 0000 [#2] SMP PTI
Oct  5 19:37:37 steamos kernel: [  463.777998] CPU: 4 PID: 3080 Comm: alienware_wmi_c Tainted: P      D    O      4.19.0-0.steamos2.3-amd64 #1 Debian 4.19.45-1~steamos2.1
Oct  5 19:37:37 steamos kernel: [  463.778000] Hardware name: Alienware ASM100/0J8H4R, BIOS A04 07/14/2015
Oct  5 19:37:37 steamos kernel: [  463.778006] RIP: 0010:kfree+0x159/0x180
Oct  5 19:37:37 steamos kernel: [  463.778009] Code: ff ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 c7 f7 ff ff 49 8b 02 f6 c4 80 75 0a 49 8b 42 08 a8 01 75 02 <0f> 0b 49 8b 02 31 f6 f6 c4 80 74 05 41 0f b6 72 51 5b 5d 41 5c 4c
Oct  5 19:37:37 steamos kernel: [  463.778011] RSP: 0018:ffffa00d42907de8 EFLAGS: 00010246
Oct  5 19:37:37 steamos kernel: [  463.778013] RAX: fffff285c7b10f88 RBX: ffffffff82e3eecc RCX: 0000000000000000
Oct  5 19:37:37 steamos kernel: [  463.778015] RDX: 0000000000000000 RSI: ffff945496b255e0 RDI: ffffffff82e3eecc
Oct  5 19:37:37 steamos kernel: [  463.778017] RBP: 0000000000000000 R08: 00000000000255e0 R09: ffffffff8308b2bf
Oct  5 19:37:37 steamos kernel: [  463.778018] R10: fffff285c7b10f80 R11: 000000000000005f R12: ffffffffc017d3c4
Oct  5 19:37:37 steamos kernel: [  463.778020] R13: ffff94548da43180 R14: ffffa00d42907ee0 R15: ffff94548da431a0
Oct  5 19:37:37 steamos kernel: [  463.778023] FS:  00007fc33aceb700(0000) GS:ffff945496b00000(0000) knlGS:0000000000000000
Oct  5 19:37:37 steamos kernel: [  463.778024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  5 19:37:37 steamos kernel: [  463.778026] CR2: 00000000023e5000 CR3: 00000001aea20005 CR4: 00000000001606e0
Oct  5 19:37:37 steamos kernel: [  463.778028] Call Trace:
Oct  5 19:37:37 steamos kernel: [  463.778036]  alienware_wmax_command+0x74/0xc0 [alienware_wmi]
Oct  5 19:37:37 steamos kernel: [  463.778041]  ? _cond_resched+0x15/0x30
Oct  5 19:37:37 steamos kernel: [  463.778044]  ? __kmalloc+0x5c/0x210
Oct  5 19:37:37 steamos kernel: [  463.778047]  toggle_hdmi_source+0x4b/0xd0 [alienware_wmi]
Oct  5 19:37:37 steamos kernel: [  463.778051]  kernfs_fop_write+0x113/0x190
Oct  5 19:37:37 steamos kernel: [  463.778055]  vfs_write+0xb3/0x1a0
Oct  5 19:37:37 steamos kernel: [  463.778058]  ksys_write+0x5a/0xd0
Oct  5 19:37:37 steamos kernel: [  463.778063]  do_syscall_64+0x61/0x320
Oct  5 19:37:37 steamos kernel: [  463.778067]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct  5 19:37:37 steamos kernel: [  463.778069] RIP: 0033:0x7fc33a1d4c20
Oct  5 19:37:37 steamos kernel: [  463.778072] Code: 73 01 c3 48 8b 0d 68 92 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d bd eb 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce 8f 01 00 48 89 04 24
Oct  5 19:37:37 steamos kernel: [  463.778073] RSP: 002b:00007ffe51f87968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Oct  5 19:37:37 steamos kernel: [  463.778076] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc33a1d4c20
Oct  5 19:37:37 steamos kernel: [  463.778077] RDX: 0000000000000004 RSI: 00000000023dd408 RDI: 0000000000000001
Oct  5 19:37:37 steamos kernel: [  463.778079] RBP: 00000000023dd408 R08: 000000000000000a R09: 00007fc33aceb700
Oct  5 19:37:37 steamos kernel: [  463.778081] R10: 00007ffe51f877c0 R11: 0000000000000246 R12: 00007fc33a49f2a0
Oct  5 19:37:37 steamos kernel: [  463.778082] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
Oct  5 19:37:37 steamos kernel: [  463.778085] Modules linked in: uinput ctr ccm bnep binfmt_misc btrfs zstd_compress libcrc32c crc32c_generic zstd_decompress xxhash xor nls_ascii nls_cp437 vfat fat raid6_pq btusb btrtl btbcm btintel bluetooth joydev intel_rapl drbg ansi_cprng ecdh_generic snd_hda_codec_hdmi arc4 x86_pkg_temp_thermal intel_powerclamp coretemp iwlmvm kvm_intel mac80211 kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic crct10dif_pclmul snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hda_core snd_hwdep iTCO_wdt iTCO_vendor_support snd_pcm iwlwifi intel_cstate intel_uncore snd_timer intel_rapl_perf snd soundcore cfg80211 pcc_cpufreq mei_me pcspkr rfkill alienware_wmi wmi_bmof mei lpc_ich evdev efi_pstore efivars nvidia_drm(PO) drm_kms_helper drm nvidia_modeset(PO) nvidia(PO) ipmi_devintf ipmi_msghandler
Oct  5 19:37:37 steamos kernel: [  463.778132]  fuse autofs4 hid_steam ext4 crc16 mbcache jbd2 fscrypto uas usb_storage hid_generic usbhid hid sg sd_mod crc32c_intel ahci libahci ehci_pci aesni_intel xhci_pci aes_x86_64 ehci_hcd libata r8169 crypto_simd xhci_hcd realtek cryptd glue_helper scsi_mod i2c_i801 libphy usbcore usb_common fan thermal wmi video button
Oct  5 19:37:37 steamos kernel: [  463.778159] ---[ end trace aa2eca6aa3c09dac ]---
Oct  5 19:37:37 steamos kernel: [  472.132029] sched: RT throttling activated
Oct  5 19:37:37 steamos kernel: [  472.132750] hpet1: lost 534 rtc interrupts
Oct  5 19:37:46 steamos kernel: [  480.399162] hpet1: lost 528 rtc interrupts
Oct  5 19:37:54 steamos kernel: [  488.631586] hpet1: lost 525 rtc interrupts
Oct  5 19:38:02 steamos kernel: [  496.919534] iwlwifi 0000:04:00.0: Queue 0 is active on fifo 7 and stuck for 2500 ms. SW [113, 115] HW [115, 115] FH TRB=0x0700072
Oct  5 19:38:02 steamos kernel: [  496.920550] hpet1: lost 528 rtc interrupts
Oct  5 19:38:02 steamos systemd[1]: Started SteamOS Autorepair.
3vi1 commented 5 years ago

I restored the machine from a clonezilla image I had made mid 2016 and everything was working fine again. So, it's definitely not a hardware issue.

Updating the system to current packages re-introduces the kernel panics.

I'm surprised there aren't more AlienWare users reporting issues.

3vi1 commented 5 years ago

Okay... I got the system back into a state where I can successfully login and use it again (not sure if it's going to try to verify installation everytime yet) by doing the following:

1) Blacklist the alienware_wmi module.

This stops the kernel fault and long hang that was present before the login prompt, but then my system was still just sitting there loading after I selected my user login - never going to the main interface.

I looked in /home/steam/.xsession-errors and found messages indicating something was trying to preload /usr/lib/i386-linux-gnu/libmodeswitch_inhibitor.so and of course failing (mines the top-end 64-bit Alienware system) because it's the wrong ELF class.

Since I have no idea where the preload comes in, I just did a dirty hack around it:

2) Rename /usr/lib/i386-linux-gnu/libmodeswitch_inhibitor.so and replace it with a symlink to the 64-bit DLL.

Now the system lets me log in.

I have to guess not many AlienWare owners are still using their systems on brewmaster_beta, or they'd all have issues.

dubigrasu commented 5 years ago

The "/usr/bin/steamos-session" itself is preloading the modeswitch_inhibitor, and you don't need to rename it, the wrong ELF class will be just ignored.

3vi1 commented 5 years ago

you don't need to rename it, the wrong ELF class will be just ignored.

That doesn't appear to be true: It was stuck in a loop spitting failures to xsession-errors for 20 minutes and would not proceed. First I tried to rename it so that it would just not be found, and that did nothing except change the xsession error to say it wasn't found. But, as soon as I redirected it to the 64-bit DLL and the loop iterated it proceeded to the big-picture interface (without rebooting).

dubigrasu commented 5 years ago

The error messages "wrong ELF class: ELFCLASS32): ignored" are normal though, I have them too in the xsession-errors file on my SteamOS installation. The steamos-session lists both architectures in its preload, it uses what it needs and discards (ignore) what it doesn't. If deleting/renaming the already ignored 32 bit lib somehow fixes your installation then something else is borked.

3vi1 commented 5 years ago

If deleting/renaming the already ignored 32 bit lib somehow fixes your installation then something else is borked.

I have no doubt that's true - I still see "verifying installation" on every reboot (maybe that's normal and I just don't recall). But, not knowing too much about the steam startup process, it was the most expedient way to get past the immediate problem.

There are now zero errors in any of the logs that I can see, so if there's a bigger problem - it's not giving any indicators. The real problem now is the alienware_wmi issue - I'm going to compile it myself from the latest kernel source and see if it has the same issue.

3vi1 commented 5 years ago

It works!

I grabbed https://github.com/torvalds/linux/blob/master/drivers/platform/x86/alienware-wmi.c, compiled it as a kernel module and put it in place over the existing 4.19 module. I de-blacklisted alienware-wmi, rebooted, and no longer have the kernel fault. I tested changing the light colors and it works fine too.

Homeshine commented 4 years ago

I'm having the exact same problem. The steam client updated a couple days ago and since then it just boots to black outer space and I can't do anything on the steam machine.

sharkwouter commented 4 years ago

After experiencing this issue for a while, I was no longer able to install a different operating system to my Steam Machine. The kernel panics dump log in the efi store, which makes it fill up to the point where it can no longer be written to safely (this means installing a new bootloader will fail). To fix this, you'll have to go to /sys/firmware/efi/efivars and delete the dump-* files as root or with sudo.

It took me many hours to find this out. In the end I found these instructions on the Arch wiki: https://wiki.archlinux.org/index.php/Unified_Extensible_Firmware_Interface#Requirements_for_UEFI_variable_support