ValveSoftware / steam-for-linux

Issue tracking for the Steam for Linux beta client
4.26k stars 175 forks source link

WorkThreadPool failed to shut down & game "corruption" on prefix creation #10058

Closed onegentig closed 1 year ago

onegentig commented 1 year ago

Two days ago (10. Sep), I found that Steam started behaving strangely – after closing, it could not open again (did not fully close before) and some games became not just unplayable, but also unmovable (couldn’t move, uninstall or even check integrity). I tried to narrow this issue, and it strangely seems to be caused by Proton prefix creation on some games and some Proton versions…


Update: Duplicate of https://github.com/ValveSoftware/Proton/issues/6859. The cause seems to be overflow of address space leading to the corruption. Issue is known to devs but not resolved (as of 2. Oct). There are no workarounds, just gotta wait for NVIDIA or whoever to fix their crap.

Update 2: I got tired of waiting and started using Windows for games. If you don’t plan to wait for an eternity, this would be your best bet.


Table of Contents

System Information

$ dnf list installed "*steam*" ``` Installed Packages steam.i686 1.0.0.78-1.fc38 @rpmfusion-nonfree-steam steam-devices.i686 1.0.0.78-1.fc38 @rpmfusion-nonfree-updates ```
$ dnf list installed "*nvidia*" ``` Installed Packages akmod-nvidia.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver kmod-nvidia-6.4.14-200.fc38.x86_64.x86_64 3:535.104.05-1.fc38 @@commandline libva-nvidia-driver.x86_64 0.0.10-3.fc38 @updates nvidia-gpu-firmware.noarch 20230804-153.fc38 @updates nvidia-persistenced.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver nvidia-settings.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-cuda.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-cuda-libs.i686 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-cuda-libs.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-kmodsrc.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-libs.i686 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-libs.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver xorg-x11-drv-nvidia-power.x86_64 3:535.104.05-1.fc38 @rpmfusion-nonfree-nvidia-driver ```
$ inxi -F ``` System: Host: fedora Kernel: 6.4.14-200.fc38.x86_64 arch: x86_64 bits: 64 Desktop: GNOME v: 44.4 Distro: Fedora release 38 (Thirty Eight) Machine: Type: Desktop System: Gigabyte product: Z270X-Ultra Gaming v: N/A serial: Mobo: Gigabyte model: Z270X-Ultra Gaming-CF v: x.x serial: UEFI: American Megatrends v: F8 date: 10/27/2017 CPU: Info: quad core model: Intel Core i7-6700 bits: 64 type: MT MCP cache: L2: 1024 KiB Speed (MHz): avg: 3818 min/max: 800/4000 cores: 1: 3826 2: 3756 3: 3849 4: 3885 5: 3877 6: 3756 7: 3793 8: 3802 Graphics: Device-1: NVIDIA GP107 [GeForce GTX 1050] driver: nvidia v: 535.104.05 Display: x11 server: X.Org v: 1.20.14 with: Xwayland v: 22.1.9 driver: X: loaded: nvidia unloaded: fbdev,modesetting,nouveau,vesa gpu: nvidia,nvidia-nvswitch resolution: 1920x1080 API: OpenGL v: 4.6.0 NVIDIA 535.104.05 renderer: NVIDIA GeForce GTX 1050/PCIe/SSE2 Audio: Device-1: Intel 200 Series PCH HD Audio driver: snd_hda_intel Device-2: NVIDIA GP107GL High Definition Audio driver: snd_hda_intel Device-3: Trust GXT 258 Microphone driver: hid-generic,snd-usb-audio,usbhid type: USB API: ALSA v: k6.4.14-200.fc38.x86_64 status: kernel-api Server-1: PipeWire v: 0.3.79 status: active Network: Device-1: Intel Ethernet I219-V driver: e1000e IF: enp0s31f6 state: up speed: 1000 Mbps duplex: full mac: 1c:1b:0d:97:f8:f7 Bluetooth: Device-1: Cambridge Silicon Radio Bluetooth Dongle (HCI mode) driver: btusb type: USB Report: btmgmt ID: hci0 state: up address: 33:03:30:09:94:67 bt-v: 4.0 Drives: Local Storage: total: 2.69 TiB used: 1.3 TiB (48.2%) ID-1: /dev/sda vendor: Samsung model: SSD 870 QVO 1TB size: 931.51 GiB ID-2: /dev/sdb vendor: Western Digital model: WD1003FZEX-00K3CA0 size: 931.51 GiB ID-3: /dev/sdc vendor: SanDisk model: EMTEC X150 960GB size: 894.25 GiB Partition: ID-1: / size: 143.62 GiB used: 25.3 GiB (17.6%) fs: ext4 dev: /dev/dm-0 ID-2: /boot size: 877.5 MiB used: 288.7 MiB (32.9%) fs: ext4 dev: /dev/sda6 ID-3: /boot/efi size: 99.8 MiB used: 17.3 MiB (17.4%) fs: vfat dev: /dev/sda2 ID-4: /home size: 190.87 GiB used: 72.41 GiB (37.9%) fs: ext4 dev: /dev/dm-1 Swap: ID-1: swap-1 type: zram size: 8 GiB used: 512 KiB (0.0%) dev: /dev/zram0 ID-2: swap-2 type: partition size: 4.66 GiB used: 0 KiB (0.0%) dev: /dev/sda8 Sensors: System Temperatures: cpu: 50.0 C mobo: N/A gpu: nvidia temp: 38 C Fan Speeds (rpm): N/A gpu: nvidia fan: 35% Info: Processes: 359 Uptime: 44m Memory: total: 16 GiB available: 15.56 GiB used: 5.32 GiB (34.2%) Shell: Zsh inxi: 3.3.29 ```

Description

Behaviour

Relevant to #6811 but with "extra steps" – similarly, Steam cannot shut down properly. When exited from app, I would expect a message like [2023-09-12 04:32:44] Shutdown, but instead, I get something like this:

src/common/pipes.cpp (885) : stalled cross-thread pipe.
09/11 23:34:50 Init: Installing breakpad exception handler for appid(steam)/version(1690583737)/tid(57003)
assert_20230911233450_36.dmp[58317]: Uploading dump (out-of-process)
/tmp/dumps/assert_20230911233450_36.dmp
src/clientdll/steamclient.cpp (901) : bufRet.TellPut() == sizeof(uint8)
assert_20230911233450_36.dmp[58317]: Finished uploading minidump (out-of-process): success = yes
assert_20230911233450_36.dmp[58317]: response: CrashID=bp-249e29b9-37bd-4fcb-bc6a-141592230911
assert_20230911233450_36.dmp[58317]: file ''/tmp/dumps/assert_20230911233450_36.dmp'', upload yes: ''CrashID=bp-249e29b9-37bd-4fcb-bc6a-141592230911''
Thread "CJobMgr::m_WorkThreadPool:0" (ID 57111) failed to shut down

Afterwards, the last line about a WorkThreadPool failing to shut down is repeated and Steam never really quits – it has to be killed manually with killall -9 steam. I do not observe a higher CPU usage, or at least I haven’t noticed. But not being able to shut down Steam would be the least of all the problems…

The worst of all is that most games cannot start anymore – pressing "Play" either stalls on "Launching" or starts "Running" with no window opening (Proton runs in the background, but does nothing of note). Games affected by this problem cannot be un-installed, moved or have a broken (never-ending never-updating) integrity check.

The only way out of this state for me was to wipe Steam’s folders ~/.steam + ~/.local/share/Steam (reinstall without wiping these folders did not work). This clean slate was able to work properly, until it somehow broke again. (ADDENDUM: Games were installed in the default directory.)

Cause

It’s weird. My initial guess was a permission issue or something but no logs pointed to it, and no adjustments, permission settings or even just plain removing all safety features of my PC did anything to resolve the issue. The issue was not spontaneous though, and I was able to tell by a failed launch attempt (where game did not start and crash, but stalled on "Launching" or started "Running" without a window) that this Steam install is now "corrupt."

I did some experimenting, and it seems that the issue was caused by prefix creation for seemingly specific games on random Proton versions. It may not be the root cause, but that’s what I have seen during Proton version switching and watching the logs. Flower worked for the 5.0-10 and 4.11-13, crashed but didn’t "corrupt" on 4.11-13, but versions 7.0-6, 6.3-8 and 5.13-6 caused Steam to get "corrupted". Proton 8.0-3 was listed as "worked," but exactly as I was writing this, I installed that version and it broke.

Why this game in particular? Because it was small, and I had to reinstall everything a lot (and I have a VERY sluggish internet). For another example TES3: Morrowind had no issue with Proton 5.13-6, which Flower always "rejected". I tried several Proton versions with Morrowind, and even if some of them crashed, none "corrupted". On the other hand, Freefall Tournament, an ancient Flash game, "corrupted" Steam on Proton 8.0-3.

This problem also seemed to have occurred not during the game execution or running, but as soon as prefix creation. Starting a game download and setting Proton version to one I knew would cause "the corruption" would corrupt during "Finalising" stage with no need to run the game itself. Finalisation would not finish on it’s own either – pressing "Pause" would make it seem like it was complete, but of course, the game would not start.

The "corruption" happened around the time this was logged to the terminal:

Proton: Upgrading prefix from None to 8.0-104 (/home/onegen/.local/share/Steam/steamapps/compatdata/966330/)

Potentially unrelated, but I noticed that this line gets printed during the prefix creation when it gets corrupted (unconfirmed, would need more testing to be sure): DISPROVEN

wine: RLIMIT_NICE is <= 20, unable to use setpriority safely

Ghostrunner Demo was the first game I noticed this error on, running on default Proton Experimental. Due to its size, I did not test it further. I usually play same games over and over and only seldom, so this error could have "happened" anytime. I remember that Warframe worked on Experimental (would’ve noticed very soon if it caused the problem).

I wish I had more to go on (specific games to specific versions, types of games etc.), I was even writing down version that failed to each try, but there just was no game-to-version consistency… :/

Steps for reproducing this issue

  1. Install a non-native game on Steam for Linux.
  2. Select some Proton version and await prefix creation.
  3. If not "corrupted", go to 2. (different version) or 1. (different game)

What I tried so far

kisak-valve commented 1 year ago

Hello @onegentig, you've described the issue being tracked at https://github.com/ValveSoftware/Proton/issues/6859 and Steam waiting on the first run setup worker task to complete.

Closing in favor of the older issue report.

4nexus5 commented 7 months ago

Any new updates or solutions? I am on AMD GPU and the issue still persists. None of the solutions I found online worked for me (disabling shader pre-caching, reinstalling, downloading additional drivers etc.). I am on Arch Linux