Closed teSill closed 11 months ago
Does the exception happen server side or client side?
Generally a "broken pipe" references a service that uses named pipes/handles, such as your database client.
Where does this exception get caught in your code? Does it trace back to some certain code within this library? Can you produce a small snippet of code that will reliably reproduce the issue?
Cheers Joel
It will be very important to confirm that this exception actually comes from this library, and not from some other part of your code, such as a database client
@jchristn Thank you for the quick reply!
It happens server side. Here's the stacktrace it gave earlier (it was from an earlier version of WatsonTcp, hopefully makes no big difference):
Unable to write data to the transport connection: Broken pipe.: at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size) at WatsonTcp.WatsonTcpServer.SendHeaders(ClientMetadata client, WatsonMessage msg) at WatsonTcp.WatsonTcpServer.SendInternal(ClientMetadata client, WatsonMessage msg, Int64 contentLength, Stream stream)
It gets picked up by the ExceptionEncountered event in WatsonTcp. I'm using .NET drivers for MongoDB and PlayFab's C# SDK, so I guess those could be the other possible culprits here, however I'm not able to catch any exceptions from the code thats in charge of talking with those services.
I haven't been able to reproduce the issue at all which makes this a bit difficult. My WatsonTCP server acts as the backend for a multiplayer game and there's usually a couple of hundred connected clients online. The issue starts arising usually a few hours into the session but sometimes longer. I haven't been able to see any patterns in it, it happens seemingly randomly.
Thanks, this is definitely coming from this library. This is the first time I've seen "broken pipe" which is usually used for named services. Does it happen only under heavy load?
It definitely seems to happen more often with more players in the game, but I'm not certain if it's due to heavier load the server is under or if it just has higher odds of happening with more players going in and out of the game.
It did happen more often when there was a lot of network traffic happening in a short amount of time. To lay out the full scenario, the server was running hourly events where the players could participate in, always at xx:00. This meant that hundreds of people were coming joining the events at once and a lot of network messages were being sent back and forth in a very short period of time. The broken pipe exception came up pretty often during those, but it didn't always lead to a "crash" like mentioned in the OP. I've since disabled those events and the broken pipe exception comes up less often, but it seems to eventually still creep up.
Is this running on a Windows or Linux machine? Which runtime version are you using?
.NET Core 3.1, it's running on an Ubuntu droplet on Digital Ocean.
I'm still not sure what the root cause is, but when it happens it seems to basically clog the client connection attempts. The server takes a really long time to process the connections and sometimes it's able to process them all at once after a sizeable delay, but usually not. The only exception that gets consistenly printed out during this time is the broken pipe one.
Disconnects and other messaging between clients and the server work fine even in that state though, so already connected users aren't impacted at all and their disconnections are properly handled. Edit: apparently that doesn't apply to everyone. Some users aren't able to communicate with the server when that happens, but most seem to be.
In the past 3 or so hours it has happened a few times but the server has been able to recover on its own and let connections through. This is what seems to happen:
Broken pipe
exceptions before this happens, but I can't say for sure if that's directly what's causing the server to act this way.Broken pipe
exception gets printed out. A lot. This is one of those problems that is going to be impossible to diagnose at runtime. We really need a simple set of code that reproduces the issue. I don't know how to proceed otherwise.
It's a tough one. I didn't run into the problem at all when playtesting the game for weeks when CCU was less than 100, but now in the 300-600 CCU range it's happening frequently. Or if the problem did show up, the server always managed to recover on its own unlike now. I have no idea how to reproduce it.
I could share with you the repository if you'd like to have a closer look. I don't really know how to take smaller pieces out of it as I'm not sure what might or might not be relevant. But I realize that's asking for a lot from your end!
First thing I would check is the CPU utilization, memory utilization, number of processes (and thread count). Anything that wraps TCP is going to be vulnerable to system/load conditions affecting their operation. In your case you could try scaling up the hardware to see if the errors don't occur as quickly (or if you're able to handle a larger number of connections as a result).
If you can produce code that simulates the load, I'd be happy to take a look at that specific code.
Restructured some things and started using Server.SendAsync
instead of Server.Send
and the issues appear a bit clearer now. It seems that there are 2 exceptions that lead to the Broken pipe
and Cannot access a disposed object
spam. More commonly it seems to be Connection timed out
exception and more rarely No route to host
. I haven't enabled Keepalives by the way. After running into one of them, the server logs the Cannot access a disposed object
exception hundreds of times. Now using the asynchronous methods it doesn't clog up the entire traffic anymore and the problem doesn't appear to cause as many issues (if any).
Still trying to come up with a way to reproduce the issue outside of production, will post the code here when/if I succeed.
Thanks for the update @teSill
Hi @teSill closing this, since the next release will guide everyone into the async APIs. Please re-open if you disagree.
Created a discussion about this earlier as I figured it was something common enough that I've probably messed up on my end, but the issue has been bugging me for weeks now and I haven't been able to find the cause.
After observing it some more, at some point during the session there will appear an
Unable to write data to the transport connection: Broken pipe.
exception. I don't know what exactly happens client-side for this exception to trigger, but when it happens it seems to massively snowball until the server is unable to handle connections and disconnections and eventually becomes unresponsive.I'd love to hear any ideas about tackling this, and please let me know if there's any relevant connection related code I can post that would make sense to look into. It's becoming a massive problem in my game and requiring constant attention from me.
Thanks!