Large Messages Dropped - Githubissues

bridger commented 5 years ago

I've noticed that in production sometimes large messages (200kb) sent by the client are never received by Websocket callback on the server. The connection remains open and later messages are still received.

It seems to be a race condition because I can't always repro it. It may have to do with several connections at the same time because I can only repro it when running on my production server and not when running Docker locally.

I suspected a race condition so I've been hunting through the code for suspicious areas. These are some theories:

Here a socket handler is used without using the shQueue. https://github.com/IBM-Swift/Kitura-net/blob/master/Sources/KituraNet/IncomingSocketManager.swift#L241 Is each socketHandler guaranteed to be accessed by only one thread? Somehow the code expects the socketHandler's readBuffer to have changed, which suggests multiple threads.

[Removed incorrect comment here about WSSocketProcessor's process() function not making sense]

ianpartridge commented 5 years ago

Hi @bridger thanks for reporting this. The expert on this code is @djones6 but he's on vacation this week I'm afraid.

bridger commented 5 years ago

Thanks @ianpartridge. I'll follow up again next week. There may still be a concurrency issue, but I am seeing I was wrong about the buffer with incomplete data. WSSocketProcessor has a byteIndex ivar which keeps track of how much of the buffer it has read. If it reads partially into the buffer then byteIndex should keep track of how far it read for the next call.

bridger commented 5 years ago

I tried building with the GCD_ASYNCH compiler flag to used GCD instead of epoll. That seemed to help because I am not able to repro the issue anymore. The downside appears to be CPU usage on the server. It used to peak around 11% CPU usage, but now it can sometimes max out the CPU. Here is a graph that shows CPU usage this week. You can clearly see when I deployed the GCD_ASYNCH version.

djones6 commented 5 years ago

I've been having a look at this, and I so far haven't spotted how this could occur (a message going missing), particularly if the connection recovers later. That implies that reading from the socket is performing correctly - otherwise we'd get out of sync and be reading arbitrary bytes while expecting the start of a message.

The issue with shQueue I think is okay - that queue provides a read/write locking mechanism, to guard against concurrent access to the socketHandlers dictionary in the case where one of the threads is mutating it. IIRC, a handler's readBuffer should not get accessed concurrently - it should only be accessed by the thread that has woken up to perform the socket read (I'll need to read the code again to confirm this). In the case of epoll, that's always one of two dedicated epoll threads, whereas with GCD, it's a Dispatch thread processing a serial queue.

The problem with CPU consumption with GCD is itself concerning and also warrants some investigation. So far, I've been unable to recreate either issue locally; @bridger if you have any sample code you could share (such as an extract or something equivalent to your production code), that would be very helpful.

djones6 commented 5 years ago

I started off trying to recreate this issue with plain Kitura + Kitura-net, but haven't managed to reproduce either issue yet.

Running in a constrained Docker environment and stressing a POST route with 100,000 x 256k payloads across 10 concurrent clients, I saw zero requests dropped.

Running in GCD_ASYNCH mode, and low request rate (so that unexpected spikes in CPU consumption would be visible), so far both epoll and GCD produce similar (and stable) CPU utilisation.

So I suspect we're looking at a problem specific to websockets.

bridger commented 5 years ago

Thank you so much for looking into this @djones6! This is a difficult one to repro. It makes sense it could be specific to Websockets. I was only reproing in my production environment, which makes me think it has to do with concurrent connections.

I'll message you on the Slack with my source code to get a better look. Here is the details for when it was reproing

I have two websocket listeners. They work very similarly. One is a "demo" version that doesn't back things up to S3 and has less auth restrictions.
- CanvasDocumentController listens on "api/canvas/ws"
- DemoCanvasDocumentController listens on "api/canvas/demo/ws"
The client connects to DemoCanvasDocumentController. The connection is put into a holding place until another client connects to the same demo.
Once another client connects, both connections are handed to a CanvasDocumentServer
The connections get an app-specific "Connected" message
In response to the "connected" message the client will send a large catch-up message. This is the message that is sometimes lost.
The client continues to send messages which are much smaller. Those messages are received properly.

djones6 commented 5 years ago

@bridger Apologies for the radio silence. I'm afraid I have been unable to recreate this issue locally - initially I thought I'd found something, but that turned out to be a flaw in my own application.

I did end up stressing Kitura-net and WebSockets quite a lot through trying to recreate the loss of a message, including running in a constrained environment (docker container running with an 0.1 CPU allocation - which can sometimes flush out concurrency related bugs), but all messages were received as expected by Kitura's handler. So that's good, but it doesn't help debug this issue.

You mentioned that the bug is that the client's large message is sometimes lost. Am I correct in thinking that this is a large message being sent by a (non-Kitura) client, which is not arriving at the Kitura handler?

bridger commented 5 years ago

Thank you so much for looking into the issue! Sorry it hasn’t revealed anything yet.

You are correct in your understanding. The large message is sent by an iPad app using the SwiftWebsocket library. The Kitura handler wasn’t called to receive this large message, but it did receive subsequent smaller messages.

I’m hopeful this might be coincidently fixed with NIO2. I’ve had it running for a few days locally (with permessage-deflate disabled because of one last bug) and when I’m brave I’ll deploy it and try to repro the bug. Unfortunately, I’ve never been able to repro this bug locally and only see it when deployed to production. (My theory for that was because production has many concurrent connections.)

djones6 commented 5 years ago

When you reproduced the issue, would I be correct in thinking that you have a Kitura WebSocket server running remotely, and are accessing it via a (local) iPad? If so, I wonder whether the size of the message that you are sending is causing it to get lost (or dropped) somewhere in transit.

Since we're using TCP, it shouldn't be possible for it to get lost and for subsequent messages to be received, due to the TCP guarantees of delivery and ordering. So it would have to be dropped (perhaps by a proxy) on its way to the server.

The other possibility is that the websocket client is sometimes dropping the large message when it is transmitting over a 'real' network (localhost behaves quite differently).

What we could do to debug it is deploy a version of Kitura-net with some much more verbose logging so that we can see what data it is reading off the wire. This would tell us whether the data is arriving, but somehow being ignored, or never arriving at all. If this is of interest, I can prepare a patched version of Kitura-net.

Kitura / Kitura-WebSocket

Large Messages Dropped #79