celluloid / reel

UNMAINTAINED: See celluloid/celluloid#779 - Celluloid::IO-powered web server
https://celluloid.io
MIT License
596 stars 87 forks source link

Websocket hangs when >64 KiB packet is sent across it [Windows only] #64

Closed perlun closed 6 years ago

perlun commented 11 years ago

(Cross-posting of https://github.com/jeremyd/celluloid-websocket-client/issues/10 since I honestly don't know if this is a server or client issue.)

As the subject says, I'm seeing really, really weird behavior with the celluloid-websocket-client. It seems like a "big packet" (bigger than 64 KiB) somehow breaks the socket, making is malfunction and not receive any more packets.

This is with JRuby on Windows. On OSX, everything works as it should, both with small and big packets.

More specifically, these are the exact scenarios (quoting https://github.com/perlun/celluloid-websocket-client/blob/big_packets_bug/examples/roundtrip_client.rb#L19):

if ARGV[0] == 'small'
  # Packets less than 64 KiB, using 16-bit payload length.
  #
  # Works, both on Windows (client + server) and OSX (client + server). No problems encountered with this packet size.
  msg = '123456' * 10000
elsif ARGV[0] == 'big'
  # Packets bigger than 64 KiB, triggering a 64-bit payload length (RFC 6455, section 5.2, "Payload length")
  #
  # Works on OSX client + server
  # Fails on Windows client + server (crashes the websocket so that we never get any response on the second packet either).
  # Fails on Windows client + OSX server (likewise).
  # Fails on Windows server + OSX client (likewise).
  msg = '123456' * 100000
end

To be able to properly isolate the issue, I've made a small test suite which reproduces it. You can find it here: https://github.com/perlun/celluloid-websocket-client/tree/big_packets_bug

(clone my repo and checkout the big_packets_bug branch.)

Run the server like this (after bundle install --standalone etc), in two different windows:

jruby --1.9 -Ilib examples/roundtrip.rb # this starts the (Reel) server
jruby --1.9 -Ilib examples/roundtrip_client.rb # surprisingly enough, this starts the client...

I'm amazed and really don't have any clues as to what's happening. I've looked a bit at both ends, trying to insert puts statements here and there but any constructive feedback would be incredibly helpful here. Naturally, I'll gladly help out with the debugging of this issue..

Many thanks in advance.

perlun commented 11 years ago

Hi,

Any ideas about how I should try and tackle this? I've thought of monkeypatching TCPSocket to try and get some more debug output about the packets being sent over the wire (and the state of the sockets). Other than that, I'm a bit out of ideas. I could wireshark the whole thing but it feels a bit overkill...

From what I saw when I tested (w/ Fiddler as man-in-the-middle), the client seemed to send the packets fine, even after a 64 KiB packet has been transmitted. But it never seemed to reach Reel, which suggests that it could be something on the Reel side causing the socket to be stalled. The whole issue being Windows-only makes it all quite bizarre, but as I said; I'm more than willing to help out in debugging this, as long as I get some direction as to where I should proceed with my wandering.

tarcieri commented 11 years ago

@perlun Windows-only definitely makes the problem much more difficult. I would say it's indicative of problems either internal to JRuby, Java NIO, or nio4r.

You might see if @halorgium's logging branch can get you more information about where things are stalling

perlun commented 11 years ago

Yeah, I can understand that. Unfortunately we are still using Windows on some servers & dev machines (well, we are actually developing Windows software after all ;) so it's hard to get around in our case.

Thanks for the hints. What branch are you talking about more specifically? https://github.com/halorgium/reel/tree/hijacking or something else? I didn't find it now when looking for it briefly.

tarcieri commented 11 years ago

This branch: (cc @halorgium)

https://github.com/halorgium/celluloid/tree/logging

perlun commented 11 years ago

I gave it a try now briefly, but the signal to noise ratio was quite low. :) I guess I would have to tweak it a bit, or pipe the output to a file and start digging...

tarcieri commented 11 years ago

Yeah, it's definitely an event firehose. Perhaps you could try running it on both Windows and Linux then comparing the output for discrepancies?

perlun commented 6 years ago

Age old, closing.