Bobris / Nowin

Owin Web Server in pure .Net
MIT License
537 stars 78 forks source link

corruption/threading issues in transport layer? #70

Open mcleodia opened 7 years ago

mcleodia commented 7 years ago

Hi

tl/dr: reproducible example of instabilities: data corruption, assertions, null ref exceptions probably caused by bug within transport layer is attached.

I know you are deprecating this in favour of kestrel, but we were evaluating the library for use with an existing mono based server which needs to target quite a few obscure architectures (e.g. mono on qnap on various flavours of arm). I'm not sure that the kestrel stuff will work for us right now for that reason. A simple test of kestrel on mac using mono died horribly. I also didn't get as far as finding/porting the libuv stuff for qnap, so this project seemed like a good stopgap.

Anyway, long story short, some of our clients were experiencing receiving random connection closes from the server (we were never actively closing from the server!). So I wrote a stress test app (attached) and it seems to display quite a lot of instability.

I swapped out a couple of alternative server implementations (MSHttpListener and kestrel). They run stably, so I'm guessing that nowin is the culprit. The owin.websocket wrapper is fairly trivial and the errors all seem to be down at the transport layers.

Errors seen include null ref on _socket in SaeaLayerCallback - : _socket.ReceiveAsync(_receiveEvent); line 270.

I also saw an assertion followed by some corrupted websocket message data being passed up to the higher layers. Think the assertion was
Debug.Assert(StartBufferOffset + ReceiveBufferPos == offset || _waitingForRequest); in Transport2HttpHandler, line 1039.

Anyway, it's all a bit intermittent and probably all caused by a single low level bug down in the guts of the transport layers. I'll try and have a pop at it, but no guarantees I'll be able to find it :)

If you have any pointers towards helping me debug this / potentially weak areas etc I'd be really grateful for any input!

Cheers Iain

mcleodia commented 7 years ago

socketbug.zip

Bobris commented 7 years ago

Null ref on line 270 just solve by check:

            using (StopExecutionContextFlow())
            {
                var s = _socket;
                if (s == null)
                {
                    _handler.FinishReceive(null, 0, -1); // report closed socket
                    return;
                }
                willRaiseEvent = s.ReceiveAsync(_receiveEvent);
            }

Enable tracing (look at end of TraceSources.cs how to do it) and add tracing to send content like it is already in receive:

    TraceSources.CoreDebug.TraceInformation(Encoding.UTF8.GetString(buffer, offset, length));

Another option to try is to return something like 2+ years back in commits.

marius-klimantavicius commented 7 years ago

It is possible that there might some threading issues (FinishReceiveData and ReceiveAsync both call ParseWebSocketReceivedData and ReceiveAsync comes from consuming app thread, and FinishReceiveData from IO thread [IOCompleted] - I don't think this is an issue for Nowin [I have my own fork that tries to handle ping/pong and this causes threading issues]). I suggest modifying TraceSources to include (managed) thread id, you could also emit part of callstack in the trace sources to help to pin point the issues (by using StackFrame or StackTrace classes - for debugging only as they are slow).

Looking into corefx ClientWebSocket implementation (managed version) it seems that it tries to send Ping frames which might cause issues see #60 - try disabling KeepAlive in your test.

If you are only interested in WebSockets you could also look into other fully managed implementations (https://github.com/vtortola/WebSocketListener and https://github.com/StackExchange/NetGain seem to be the most promising). I am interested in having a lightweight code-only embeddable full http server thus I am sticking with Nowin (my fork though).