dotnet / WatsonTcp

WatsonTcp is the easiest way to build TCP-based clients and servers in C#.
MIT License
600 stars 117 forks source link

Unable to write data to the transport connection: Broken pipe. #249

Closed teSill closed 11 months ago

teSill commented 1 year ago

Created a discussion about this earlier as I figured it was something common enough that I've probably messed up on my end, but the issue has been bugging me for weeks now and I haven't been able to find the cause.

After observing it some more, at some point during the session there will appear an Unable to write data to the transport connection: Broken pipe. exception. I don't know what exactly happens client-side for this exception to trigger, but when it happens it seems to massively snowball until the server is unable to handle connections and disconnections and eventually becomes unresponsive.

I'd love to hear any ideas about tackling this, and please let me know if there's any relevant connection related code I can post that would make sense to look into. It's becoming a massive problem in my game and requiring constant attention from me.

Thanks!

jchristn commented 1 year ago

Does the exception happen server side or client side?

Generally a "broken pipe" references a service that uses named pipes/handles, such as your database client.

Where does this exception get caught in your code? Does it trace back to some certain code within this library? Can you produce a small snippet of code that will reliably reproduce the issue?

Cheers Joel

jchristn commented 1 year ago

It will be very important to confirm that this exception actually comes from this library, and not from some other part of your code, such as a database client

teSill commented 1 year ago

@jchristn Thank you for the quick reply!

It happens server side. Here's the stacktrace it gave earlier (it was from an earlier version of WatsonTcp, hopefully makes no big difference): Unable to write data to the transport connection: Broken pipe.: at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size) at WatsonTcp.WatsonTcpServer.SendHeaders(ClientMetadata client, WatsonMessage msg) at WatsonTcp.WatsonTcpServer.SendInternal(ClientMetadata client, WatsonMessage msg, Int64 contentLength, Stream stream)

It gets picked up by the ExceptionEncountered event in WatsonTcp. I'm using .NET drivers for MongoDB and PlayFab's C# SDK, so I guess those could be the other possible culprits here, however I'm not able to catch any exceptions from the code thats in charge of talking with those services.

I haven't been able to reproduce the issue at all which makes this a bit difficult. My WatsonTCP server acts as the backend for a multiplayer game and there's usually a couple of hundred connected clients online. The issue starts arising usually a few hours into the session but sometimes longer. I haven't been able to see any patterns in it, it happens seemingly randomly.

jchristn commented 1 year ago

Thanks, this is definitely coming from this library. This is the first time I've seen "broken pipe" which is usually used for named services. Does it happen only under heavy load?

teSill commented 1 year ago

It definitely seems to happen more often with more players in the game, but I'm not certain if it's due to heavier load the server is under or if it just has higher odds of happening with more players going in and out of the game.

It did happen more often when there was a lot of network traffic happening in a short amount of time. To lay out the full scenario, the server was running hourly events where the players could participate in, always at xx:00. This meant that hundreds of people were coming joining the events at once and a lot of network messages were being sent back and forth in a very short period of time. The broken pipe exception came up pretty often during those, but it didn't always lead to a "crash" like mentioned in the OP. I've since disabled those events and the broken pipe exception comes up less often, but it seems to eventually still creep up.

jchristn commented 1 year ago

Is this running on a Windows or Linux machine? Which runtime version are you using?

teSill commented 1 year ago

.NET Core 3.1, it's running on an Ubuntu droplet on Digital Ocean.

teSill commented 1 year ago

I'm still not sure what the root cause is, but when it happens it seems to basically clog the client connection attempts. The server takes a really long time to process the connections and sometimes it's able to process them all at once after a sizeable delay, but usually not. The only exception that gets consistenly printed out during this time is the broken pipe one.

Disconnects and other messaging between clients and the server work fine even in that state though, so already connected users aren't impacted at all and their disconnections are properly handled. Edit: apparently that doesn't apply to everyone. Some users aren't able to communicate with the server when that happens, but most seem to be.

teSill commented 1 year ago

In the past 3 or so hours it has happened a few times but the server has been able to recover on its own and let connections through. This is what seems to happen:

  1. After some time (highly varies, sometimes it starts happening 15 minutes after a server restart and sometimes 6 hours), players are no longer able to connect to the server. The server doesn't seem to be able to process connections at all. There seems to sometimes be some Broken pipe exceptions before this happens, but I can't say for sure if that's directly what's causing the server to act this way.
  2. When the server enters that stage, the Broken pipe exception gets printed out. A lot.
  3. Sometimes after 5-10 or so minutes the server is unclogged and processes all the connections that were attempted during during the clog and everything seemingly returns to normal. Other times, the server doesn't unclog and connections don't go through until I restart the entire process.
jchristn commented 1 year ago

This is one of those problems that is going to be impossible to diagnose at runtime. We really need a simple set of code that reproduces the issue. I don't know how to proceed otherwise.

teSill commented 1 year ago

It's a tough one. I didn't run into the problem at all when playtesting the game for weeks when CCU was less than 100, but now in the 300-600 CCU range it's happening frequently. Or if the problem did show up, the server always managed to recover on its own unlike now. I have no idea how to reproduce it.

I could share with you the repository if you'd like to have a closer look. I don't really know how to take smaller pieces out of it as I'm not sure what might or might not be relevant. But I realize that's asking for a lot from your end!

jchristn commented 1 year ago

First thing I would check is the CPU utilization, memory utilization, number of processes (and thread count). Anything that wraps TCP is going to be vulnerable to system/load conditions affecting their operation. In your case you could try scaling up the hardware to see if the errors don't occur as quickly (or if you're able to handle a larger number of connections as a result).

If you can produce code that simulates the load, I'd be happy to take a look at that specific code.

teSill commented 1 year ago

Restructured some things and started using Server.SendAsync instead of Server.Send and the issues appear a bit clearer now. It seems that there are 2 exceptions that lead to the Broken pipe and Cannot access a disposed object spam. More commonly it seems to be Connection timed out exception and more rarely No route to host. I haven't enabled Keepalives by the way. After running into one of them, the server logs the Cannot access a disposed object exception hundreds of times. Now using the asynchronous methods it doesn't clog up the entire traffic anymore and the problem doesn't appear to cause as many issues (if any).

Still trying to come up with a way to reproduce the issue outside of production, will post the code here when/if I succeed.

jchristn commented 1 year ago

Thanks for the update @teSill

jchristn commented 11 months ago

Hi @teSill closing this, since the next release will guide everyone into the async APIs. Please re-open if you disagree.