ValveSoftware / Proton

Compatibility tool for Steam Play based on Wine and additional components
Other
24.3k stars 1.06k forks source link

Game launchers hang, show connection timeouts or render incorrectly #5882

Closed kakra closed 2 years ago

kakra commented 2 years ago

Proton version: Experimental and any 7.x I tried Hardware: i7-12700K, NVIDIA GTX 1660 Ti 6GB, driver series 515 (and before) OS: Gentoo Linux with custom-patched kernel (adding fsync support), 5.15.44 Desktop: KDE Plasma 5.25 with Kwin/X11 Monitors: 3840x2160, 3840x2160, 1920x1080

I'm seeing a lot of problems with different launchers that I cannot attribute to using fsync/esync although they look similar. After chatting with @ivyl I tried a few different things to narrow the problem down. This does not seem to be a problem specific to one launcher or one game but rather something within Proton or the graphics stack, so I'm just linking the game-specific issues here and create a new report.

Affected games:

Observation:

Uplay Connect will usually launch into a busy loop, not showing any window. The CPU fans do speed up to a clearly audible level. This situation will not timeout. This won't be solved by running with fsync/esync disabled. The game needs to be force-closed from within Steam because no window will be shown. Perf screenshots: image image

Rockstar Launcher will show a window, but will just sit there playing the nice "R-STAR" animation (from the first small window) until it eventually times out (usually after two or more minutes). CPU fans will speed up. After the error message, the Rockstar launcher cannot be closed from the system tray and has to be force-closed via top or force-closing via Steam. It's probably busy-looping, too.

Elite Launcher for Elite Dangerous Odyssey will actually launch and can also run the game but the launcher won't render correctly (already mentioned in https://github.com/ValveSoftware/Proton/issues/150#issuecomment-1052565192 where @redmcg mentioned it could be a multi-GPU issues but I run only a single GPU): Window controls (close, minimize), the launch button and the game edition selector won't render. The launcher web contents render just fine. The game itself is not affected and runs fine (at least under the estimations of its current performance). The launcher does quit correctly when (blindly) clicking the close button.

image

TESO Launcher will usually launch but may hang for multiple minutes before it starts checking / downloading updates. This may lead to a timeout situation but if it manages to start the connection, it reliably works and starts the game. So this may be a different problem.

Steps to analyze:

Running a vanilla kernel doesn't fix issues. Also, running with fewer CPU cores doesn't fix the issues. Running with or without "simulate sched quantum" doesn't fix the issues. I'm running with some ClearLinux glibc patches but running vanilla glibc does not fix the issues either.

Running with disabled e-cores (12th gen has asymmetric core architecture) doesn't fix the issues.

By chance, in 1 of 100 tries, all launchers will work flawlessly for some hours. This doesn't need a reboot. It just works and affects all launchers at the same time. In the same manner, the launchers will stop working all at the same time again for no obvious reason, completely out of the blue, no reboot required.

Running the games within gamescope will 100% fix all launchers.

Running the games with wine virtual desktop enabled will 100% fix all launchers. (protontricks APPID vd=3840x2160)

So that gave some interesting clue: I now disabled all but one of my three monitors in nvidia-settings, and the launchers will work 100% now. Enabling at least one additional monitor makes the launchers fail again. This is 100% reproducible: Disable all but one monitor = launchers work, enable more than one monitor = launchers won't work.

Using gamescope or wine virtual desktop exposes the launchers to just one monitor, so it seems that's basically the same as running native X11 with just one monitor. But this interferes with fullscreen mode (seems it overrides borderless mode) and captures the mouse into the game screen (and at least for Elite Dangerous, I need to be able to move it out of the game).

As vsync on nvidia seems broken, I tried a few other settings: change kwin tearing modes, force-enable vsync, using forced composition mode, disable/enable kwin composition, enable/disable framebuffer flipping - but nothing helped. This seems purely a multi-monitor issue. It looks like generally launchers using Chrome components seem to be affected but others may be, too.

Maybe related:

Rarely, launchers would flicker the screens when initializing graphics and then simply crash. It seems, I can no longer reproduce this with current Proton versions. Usually, it does not flicker and then there's no crash. This may be related to enabling flipping mode in nvidia-settings and may be a driver issue.

History:

I'm pretty sure this worked fine before Proton 5.x, during 5.x it probably happened sometimes and looked like it may be related to fsync/esync but retrospectively, I'm not so sure now. Since Proton 7, it permanently doesn't work (except for 1 of 100 tries). Other changes to the system seem to not affect the issue although I cannot completely exclude the NVIDIA drivers here, as vsync is broken at least since using multiple monitors, too (but it is not fixed now when disabling all other monitors).

kakra commented 2 years ago

perf record -F max -g -- %command% for Assasins Creed Origins running with Proton Experimental debug build: perf.data.gz perf report -g --stdio: perf.data.txt.gz

perf

(as recommended by @ivyl)

kakra commented 2 years ago

Update:

This seems to happen when at least two screens are configured in clone mode, read: they occupy the same desktop coordinates.

If I put all my three screens side-by-side, the launchers work. If I put two of the screens in clone mode, the launchers do not work.

ivyl commented 2 years ago

This seems to happen when at least two screens are configured in clone mode, read: they occupy the same desktop coordinates.

Thank you for the report and narrowing it down to having mirrored display. I was able to get a repro. I'm looking into this :-)

kakra commented 2 years ago

By chance, in 1 of 100 tries, all launchers will work flawlessly for some hours. This doesn't need a reboot. It just works and affects all launchers at the same time. In the same manner, the launchers will stop working all at the same time again for no obvious reason, completely out of the blue, no reboot required.

I found that my TV may go into some sort of deep-sleep (just guessing) which completely disconnects it from the HDMI port (it's the port cloned to my 4k main monitor, I'm using this setup for easy migration to couch gaming). Every now and then, it may "wake up" (maybe checking for updates or something) and the HDMI will re-appear. This could explain why in 1 of 100 tries it worked for a few hours. I never checked the presence of that device but lately during analyzing this behavior, I left the nvidia-settings panel open and discovered that sometimes the HDMI port was simply gone. It probably also explains why sometimes the displays would flicker black for a brief second: probably the port just reconnected or disconnected due to those "sleep phases"?

@ivyl Can you repro this on both your AMD and NVIDIA setups?

ivyl commented 2 years ago

Yes. I've had Elite crashing on AMD, but that turned out to be somewhere in the Nvidia drivers. In /usr/lib/libEGL_nvidia.so.515.48.07 to be exact. After uninstalling all things Nvidia I now experience the same thing where the launcher doesn't fully render. I'm looking into this.

kakra commented 2 years ago

If you want me to try some patches, let me know.

redmcg commented 2 years ago

@kakra Just wondering if you've tried running: https://gist.github.com/redmcg/5b11d2d18ff29da5d9d4886b2e1699a1

I used it to diagnose a similar issue I had (the one you referenced above): https://github.com/doitsujin/dxvk/issues/1459

The issue was if D3D9 returned more adapters than EnumDisplayDevices, DotNet assumed a "mode change" and attempted initialisation again (getting stuck in a loop and never rendering part of the launcher).

I can see the same code that did this in DotNet is now in wine-mono: https://github.com/madewokherd/wpf/blob/main/src/Microsoft.DotNet.Wpf/src/WpfGfx/core/common/display.cpp#L1629

Could be the D3D9 and EnumDisplayDevices implementations handle the reporting of mirrored devices differently. The output from the Gist above will confirm.

kakra commented 2 years ago

Just wondering if you've tried running: https://gist.github.com/redmcg/5b11d2d18ff29da5d9d4886b2e1699a1

@redmcg How do I compile and properly run it? Maybe you can provide a pre-compiled binary for convenience which I could run in a proton prefix with protontricks?

getting stuck in a loop and never rendering part of the launcher

I don't think the ED launcher loops. There's virtually no CPU usage when opened. But there is with other launchers (busy looping). Usually, they become what looks like stuck in main event loop while waiting for network replies, so it may look like a network issue but it's probably waiting on another event that never happens (or got lost) in cloned multi-monitor setups. Maybe related to your findings but especially the ED launcher seems not to loop here: No CPU usage and the GUI is still responsive (I can click the invisible buttons and it actually does something while the other launchers even cannot quit).

redmcg commented 2 years ago

I've thrown a binary up here (GitHub wouldn't let me attach it to the Gist): https://github.com/redmcg/wine-mono/releases/tag/wine-mono-0.0.2

But you should be able to compile with: x86_64-w64-mingw32-g++ d3d9Test.cpp

which should produce the file: a.exe

Which you can then run with Wine (or, as you suggested, in a proton prefix with protontricks).

kakra commented 2 years ago

Thanks.

With one monitor disabled in nvidia-settings and no monitors cloned, it says:

00e0:err:ntoskrnl:ZwLoadDriver failed to create driver L"\\Registry\\Machine\\System\\CurrentControlSet\\Services\\WineUsd": c0000142
003c:fixme:service:scmdatabase_autostart_services Auto-start service L"WineUsd" failed to start: 1114
0110:fixme:ntdll:NtQuerySystemInformation info_class SYSTEM_PERFORMANCE_INFORMATION

D3D9 says you have 2 adapter(s):
 - \\.\DISPLAY1 (NVIDIA GeForce GTX 1660 Ti)
 - \\.\DISPLAY3 (NVIDIA GeForce GTX 1660 Ti)

EnumDisplayDevices:
- 0: \\.\DISPLAY1 (OK)
- 1: \\.\DISPLAY2 (not attached to desktop)
- 2: \\.\DISPLAY3 (OK)

EnumDisplayDevices says 2 device(s) are OK

DotNet should be happy

With all three monitors enabled, and two of them in clone mode, it says:

0110:fixme:ntdll:NtQuerySystemInformation info_class SYSTEM_PERFORMANCE_INFORMATION

D3D9 says you have 2 adapter(s):
 - \\.\DISPLAY1 (NVIDIA GeForce GTX 1660 Ti)
 - \\.\DISPLAY2 (NVIDIA GeForce GTX 1660 Ti)

EnumDisplayDevices:
- 0: \\.\DISPLAY1 (OK)
- 1: \\.\DISPLAY2 (OK)

EnumDisplayDevices says 2 device(s) are OK

DotNet should be happy

This looks a bit counter-intuitive: With two monitors enabled and one disabled, it sees all three monitors and one as not attached. With three monitors enabled (two of them mirrored) and none disabled, it sees only two monitors. The first case works and doesn't show the rendering problem.

There's also the difference with the wine fixmes.

redmcg commented 2 years ago

I've managed to recreate this issue on my local too. I added debug to NtUserEnumDisplayMonitors and found that the callback being made in to wine-mono is returning error code 0x88980006 (WGXERR_DISPLAYSTATEINVALID).

I believe that's coming from here.

I can see the callback is called twice when I mirror my displays, but only once when I do not. The above error is only returned the second time, so I'm thinking the first call sets m_hMonitor, thus the second call results in the path that has the following check: m_rcBounds.find(dpiContextValue) != m_rcBounds.end()

and that check is obviously evaluating true; hence the return of the WGXERR_DISPLAYSTATEINVALID error.

So I guess we need to understand if NtUserEnumDisplayMonitors should be making two calls to the callback, and if so, then why m_rcBounds.find(dpiContextValue) != m_rcBounds.end().

ivyl commented 2 years ago

I've been looking into this as well, just got a bit busy with 7.0-3 and experimental releases.

The bug is that EnumDisplayMonitors() claims that there are two perfectly overlapping monitors (same exact RECT). On Windows you get only one monitor in case of cloning. I'm looking into fixing that.

redmcg commented 2 years ago

@ivyl Cool, thanks for sharing that. I was about to say I'd concluded the same thing.

I just updated my gist here: https://gist.github.com/redmcg/5b11d2d18ff29da5d9d4886b2e1699a1

to add the output from EnumDisplayMonitors. I ran it in both Windows and Linux and found that it is doing exactly as you say:

Here's the output on Windows (one Monitor listed):

D3D9 says you have 1 adapter(s):
 - \\.\DISPLAY1 (NVIDIA GeForce GTX 1050 Ti) 0000000000010001

EnumDisplayDevices:
- 0: \\.\DISPLAY1 (OK)
- 1: \\.\DISPLAY2 (not attached to desktop)
- 2: \\.\DISPLAY3 (not attached to desktop)
- 3: \\.\DISPLAY4 (not attached to desktop)
- 4: \\.\DISPLAY5 (not attached to desktop)
- 5: \\.\DISPLAY6 (not attached to desktop)
- 6: \\.\DISPLAY7 (not attached to desktop)

EnumDisplayDevices says 1 device(s) are OK

DotNet should be happy

EnumDisplayMonitors:
0000000000010001: \\.\DISPLAY1
0, 0, 1920, 1080

And Wine (two Monitors listed, with the same display name):


D3D9 says you have 1 adapter(s):
 - \\.\DISPLAY1 (NVIDIA GeForce GTX 1050 Ti) 0000000000000001

EnumDisplayDevices:
- 0: \\.\DISPLAY1 (OK)

EnumDisplayDevices says 1 device(s) are OK

DotNet should be happy

EnumDisplayMonitors:
0000000000000001: \\.\DISPLAY1
0, 0, 1920, 1080
0000000000000002: \\.\DISPLAY1
0, 0, 1920, 1080

And I looked into what m_rcBounds.find(dpiContextValue) != m_rcBounds.end() meant; m_rcBounds is a std::map. So the if condition validates that there is no existing entry (but there is; the first callback adds it because the display name is the same).

ivyl commented 2 years ago

This is fixed both upstream and in Proton. The change in Proton will be released with the next 7.0 dash release and in the next experimental. In the meantime you can find it in bleeding-edge (Proton Experimental's beta available through Steam).

https://github.com/ValveSoftware/wine/commit/5a4a35389becdd9b0c17516888273f0ef41a5040

ivyl commented 2 years ago

this has just landed in experimental proper

kakra commented 2 years ago

Didn't get the update yet here but as soon as it lands here, I'll test and close this. Thanks a lot for working on this!

kakra commented 2 years ago

Yay! It works!