ValveSoftware / GameNetworkingSockets

Reliable & unreliable messages over UDP. Robust message fragmentation & reassembly. P2P networking / NAT traversal. Encryption.
BSD 3-Clause "New" or "Revised" License
8.25k stars 620 forks source link

SteamNetworkingSockets lock held for xxx ms. (Performance warning). #165

Closed fpauser closed 3 years ago

fpauser commented 3 years ago

Recently reports are popping up about spamming SteamNetworkingSockets lock held for xxx ms. (Performance warning). messages in csgo. Are you aware of any recent changes that could triggere these? What can be done to debug/solve these issues?

Related: https://github.com/ValveSoftware/csgo-osx-linux/issues/2483

felipe19930 commented 3 years ago

Also, although people have claimed this issue happens on both Windows and Linux, they were actually referring to different messages. On Windows, the more common message is:

Waited XX ms for SteamNetworkingSockets lock [ReceiveMessagesOnConnection]

On Linux, however, the message is this one:

SteamNetworkingSockets lock held for XX ms. (Performance warning).  SendMessageToConnection

And it comes from here:

https://github.com/ValveSoftware/GameNetworkingSockets/blob/db2d14b917d3624efd3c8a2402eee4de3b65704f/src/steamnetworkingsockets/clientlib/steamnetworkingsockets_lowlevel.cpp#L278

And since this one says "Performance warning" at the end, I'd image it has a greater impact than the first one. Also, a few hours ago, an user posted a video of this specific message being spammed on CS:GO's developer console, while he was playing deathmatch on Arch Linux (although this "bug" does not depend on game mode):

https://youtu.be/ot_xk1Px8J8?t=51

I understand that you guys don't work on CS:GO directly, but since you work for Valve, perhaps this issue could be investigated a little with the CS:GO team.

Thanks a lot.

zpostfacto commented 3 years ago

This can happen for a very large number of reasons. This is almost always saying that we have called into the OS and the OS went out to lunch for a while servicing the call. Another thing that can happen is that while holding the lock, our thread gets pre-empted by something else running on the system. It's just very difficult to say what the problem is without having access to the computer.

And obviously it could mean we have a bug in our crappy code. :) But when I've investigated these before, most often it is downstream of some other performance issue on the computer.

If someone is on Windows and wants to use UIforETW to capture a trace and get it to me, I could look into it. You'd need to capture a trace (see the instructions below for how to capture a good one), upload it to some web space you have access to, and then send me a link. (Because the trace might have sensitive info, email me at fletcherd at valvesoftware dot com.) Please note, I cannot be tech support for the entire CSGO community. But if you take the time to capture a trace and follow those instructions and send it to me, I will look at it.

Further reading:

https://github.com/google/UIforETW/ Use this tool to capture a trace

https://developer.valvesoftware.com/wiki/VTrace Some more instructions about how to capture a good trace. (vtrace was the first incarnation of UIforETW.) You won't upload it to us directly, you'll put it in your own web space somewhere and send me a link and details about what was happened and how long before you hit "save trace" the poor performance event occurred.

guihkx commented 3 years ago

I'm the guy who posted that CS:GO video earlier today. Just thought of adding a follow up to this issue, because I was able to find a "fix" (more of a workaround, actually) for this.

Keep in mind that I have zero experience with low-level stuff such as this library, so this is more of a guess of what the problem is, more than anything else. :)

Let me begin by saying that I have Windows 10 on dual boot (which is where I actually play competitive CS:GO), and I have put thousands of hours into the game and this SteamNetworkingSockets lock held for... message never showed up once there, but as soon as I reboot to Arch Linux, I start getting a lot of those, so my first guess is that this issue is, at the very least, more prone to happen on Linux.

After having this "lock held" message hammered in my head, it reminded me of this controversial blog post made by a Google developer, which actually made Linus Torvalds respond to it, which was testing many spinlock implementations against the default CPU scheduler of Linux and also against the CPU scheduler of Windows 10.

The post is quite lengthy, but in short, the author demonstrates how most spinlock implementations will behave very badly on Linux with its default CPU scheduler. The author tested some other schedulers as well, with the "least terrible one" being MuQSS, if I remember correctly. Then the author finishes the post by suggesting developers to use mutexes instead of spinlocks.

So, here's my second guess: This library is using a spinlock mechanism that sucks on Linux with the default CPU scheduler.

To test this, I've installed a custom Linux kernel that comes with MuQSS by default, called CK (Con Kolivas).

And, well, lo and behold:

https://www.youtube.com/watch?v=Gq69_xlWGok

Not a single message in the developer console anymore. It's like magic! 😄

The bad part is that your average Linux CS:GO player won't even know this is most likely happening with their game, because a) that message only shows up with the developer console enabled (and it's disabled by default) and b) it's not easy to install a custom kernel and then enable it in the bootloader.

Of course, to be 100% sure this is what's causing it, more people need to test it...

zpostfacto commented 3 years ago

Yeah, Linux makes it hard (impossible?) to adjust thread priorities if the process is not elevated. On Windows we make the network thread highest priority so that it won't get preempted. On linux, we just have to hope for the best..

This library uses std::recursive_timed_mutex for the lock. I am not sure what gcc's STL is doing under the hood, but if it is doing any spinning, it might be exacerbating the problem.

Also, I readily confess that I am not a super duper Linux kernel expert and if there are obvious things being done wrong, I welcome any suggestions.

One more comment is that I have a big change to locking and threading being baked right now that makes it so that Source engine networking code doesn't need to take the global lock to send and receive messages. It could have a really big impact on performance (although the debug output might continue to happen). This branch has the relevant changes:

https://github.com/ValveSoftware/GameNetworkingSockets/tree/fine-grained-locking

fpauser commented 3 years ago

By measuring performance via sudo perf top --sort comm,dso I found out that in my case a service (k3s-server aka kubernetes/container stuff) was spamming events . Stopping the service improved things a lot - no more performance warnings in the console. Nevertheless I have the feeling that cs did ran faster recently before - but this might be subjective.

zpostfacto commented 3 years ago

Tagged this with Steamworks, even though the version of the code this affects is actually CSGO.

The code in this repository has fine-grained locking, which I hope will significantly improve this problem (if indeed there really is one) on CSGO.

I'll update this issue and ask for a retest when CSGO has been updated with the same fixes.

zpostfacto commented 3 years ago

Today's update has a new version of the lib that implements the fine-grained locking. Let me know if it improves anything.

guihkx commented 3 years ago

I noticed this update modified libsteamnetworkingsockets.so so I immediately rebooted onto the default Arch Linux kernel to give it a try! :D

I've played only a couple of deathmatches, but so far I haven't had a single SteamNetworkingSockets lock held for XX ms. (Performance warning). SendMessageToConnection message in the dev console! Awesome! 🎉

I guess more people will have to test it for themselves, but at least for me, this update was a HUGE improvement (if not a definite fix)!

Just to compare with the previous video I've recorded, I made a new one (please ignore the small freeze @ 02:08, I think that was caused by OBS Studio):

https://www.youtube.com/watch?v=EHUoFSdtYyo

Thank you so much for your work @fletcherdvalve!

fpauser commented 3 years ago

Did a quick test and csgo feels really snappy again. I'd also say the fix solved this issue. Thanks man!

zpostfacto commented 3 years ago

Good enough for me

realies commented 5 months ago

@zpostfacto, I think something similar is happening when running via Whiskey/Wine on macOS: https://github.com/Whisky-App/Whisky/issues/987

Any idea if this is because of the emulation or the GameNetworkingSockets library? Wondering if there are any flags that can improve this.