StormHub / NetUV

.Net standard/Core binding for Libuv
Other
76 stars 20 forks source link

SegFault on Linux when server actively terminates client connection #60

Closed oliverw closed 7 years ago

oliverw commented 7 years ago

I'm getting repeated reports of Segfaults on Linux which all seem to be related to situations where the server actively terminates client connections.

Inspecting the core dump using gdb revealed this crash location: https://github.com/libuv/libuv/blob/v0.10/src/unix/stream.c#L867

Due to optimization being turned on it might be the assert three lines below . Another bug report points to uv__write_callbacks.

Flow of events

  1. The client's subscription to the stream of incoming messages gets disposed. This may happen from any TaskPool-Thread
  2. Therefore the actual disposal is marshalled to Event-Loop's thread using an AsyncHandle (same file, Line #172
  3. Line 174 is executing on the Event-Loop thread and the Dispose on Line #174 invokes the actual connection shutdown
  4. There's usually nothing to report beyond this point 😢
StormHub commented 7 years ago

One thing you might want to try out. https://github.com/coinfoundry/miningcore/blob/master/src/MiningCore/JsonRpc/JsonRpcConnection.cs#L110 When the stream is ended (connection dropped), the completed will be called. You might want to send the a shutdown in the completed handler.

StormHub commented 7 years ago
  1. If the client connection interrupted, the server handle error will be called and send a shutdown on the server handle.
  2. If the client connection ended (closed gracefully), the server handle complete will be called and send a shutdown on the server handle The client side is similar, I tend not to explicitly use Async handle to issue a shutdown and wait for the "end of stream" signal (completed handler). It is guaranteed to be called once.
StormHub commented 7 years ago

Those are observations from my personal experiences working with libuv. I only use Async handle if I want to schedule something within libuv thread, e.g. send some data immediately.

oliverw commented 7 years ago

@StormHub Thanks a lot for the infos Johnny. I'm going to implement the changes and see how it goes.

StormHub commented 7 years ago

Please let me know how this goes. BTW does this happen on windows platform?

oliverw commented 7 years ago

@StormHub What's not entirely clear to me from your remarks is how the server forcefully terminates the client's connection. The OnCompleted and OnError handlers are called when the connection has ended but how do you initiate this in a safe way - avoiding the crashes we are talking about? The use case here is getting rid of a misbehaving client as quickly and safely as possible.

StormHub commented 7 years ago

Sorry for the confusion. What you can try first is to put the close handle into OnCompleted, it is always called when a stream ends.

oliverw commented 7 years ago

@StormHub Ok and leave the rest as it is?

StormHub commented 7 years ago

Do you actively close server connection? If you do, just use Async to trigger a Shutdown, then later on the completed will be called.

StormHub commented 7 years ago

BTW: if you share some gist to reprod the issue, I definitely can help you much more effectively :)

oliverw commented 7 years ago

@StormHub That's the problem it only happens on production systems under load. It just happened again with the fix you recommended (moving CloseHandle to the OnCompleted handler):

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `dotnet MiningCore.dll -c /config.json'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff09bdea9f1 in uv__write (stream=0x7fefac01f880)
    at /datadrive/teamcity/6a56927d0d6f3630/submodules/libuv/src/unix/stream.c:867
867     /datadrive/teamcity/6a56927d0d6f3630/submodules/libuv/src/unix/stream.c: No such file or directory.
[Current thread is 1 (Thread 0x7fefc6ffd700 (LWP 55))]
StormHub commented 7 years ago

From the stack trace, it looks like libuv attempts to write on some invalid request handle memory. Is this the server side or client side? or both running in the same loop?

StormHub commented 7 years ago

Any prior error/exception on the handle?

oliverw commented 7 years ago

The whole issue is the server side. The clients are not under my control and don't use LibUv at all.

oliverw commented 7 years ago

Nope no managed exceptions (I would included them for sure). Just the segfault out of nowhere:

Segmentation fault (core dumped)

StormHub commented 7 years ago

Could you show me the part where you send the write requests?

oliverw commented 7 years ago

I'm pretty sure the crash is triggered either here or in the same file Line 121.

The only way how writes are enter the picture are the two Send methods in the same source file. Line 141 and 156.

StormHub commented 7 years ago

I noticed that it is a straight byte[] writes, it might be worth experiment to see how that goes with the pooled buffer. For example: WritableBuffer buffer = tcp.Allocate(1024); buffer.WriteString("Somestring", Encoding.UTF8); tcp.QueueWriteStream(buffer, (h, e)=> { / Do some error checking/ buffer.Dispose(); } );

Net core Unsafe has some issues in linux with buffer copy.

StormHub commented 7 years ago

The pooled buffer will automatically expand, you just have to give it a reasonable size to begin.

oliverw commented 7 years ago

I noticed that it is a straight byte[] writes, it might be worth experiment to see how that goes with the pooled buffer.

Allright. I'm going to implement this. By the way I've also experimented with a work-around, but for some reason writesInFlight often did not reach a value of zero in Line 213.

Net core Unsafe has some issues in linux with buffer copy.

Interesting. Source?

StormHub commented 7 years ago

It is a way to check whether it is a buffer memory problem or handle problem. BTW: Where is the send gets called? is it within libuv thread? (it should be called libuv loop thread)

StormHub commented 7 years ago

For some Unsafe copy problem, see this commends (already reported to MS) https://github.com/StormHub/NetUV/blob/dev/src/NetUV.Core/Common/PlatformDependent.cs#L33 I am waiting for some sort of explanation at the moment.

oliverw commented 7 years ago

@StormHub You mean the two Send methods of JsonRpcConnection? Those are called from any TaskPool thread. Judging by your comment this is just plain wrong, is it? (I was under the impression the queuing logic behind your QueueWrite implementation would make sure that the actual writes happen on the Loop thread. Have I been wrong?

StormHub commented 7 years ago

Well, you have to use Async handle to marshal the send from within the libuv thread. Initially that was my intention to do this automatically but some of my fellow programmers complain I do not stick to libuv principles, it got removed. e.g. make the wrapper layer as 'thin' as possible, but not less.

oliverw commented 7 years ago

@StormHub Does the WritableBuffer also need to be allocated on the Loop thread?

oliverw commented 7 years ago

I'm asking because this implementation works, while this one starts throwing exceptions (see below) as soon as Send is called for the second time. It almost looks like as if there's a problem with the buffer pooling (the recylced buffer is somehow not correctly reset):

[2017-10-06 11:21:33.6887] [E] [xmr1] System.IndexOutOfRangeException: index: 16384, length: 238 (expected: range(0, 238)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.CheckIndex(Int32 index, Int32 fieldLength)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.CheckSrcIndex(Int32 index, Int32 length, Int32 srcIndex, Int32 srcCapacity)
at NetUV.Core.Buffers.PooledArrayBuffer`1.Set(Int32 index, T[] src, Int32 srcIndex, Int32 length)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.Set(Int32 index, T[] src)
at NetUV.Core.Buffers.ByteBufferExtensions.SetString(IArrayBuffer`1 buffer, Int32 start, String value, Encoding encoding)
at NetUV.Core.Buffers.WritableBuffer.WriteString(String value, Encoding encoding)
at MiningCore.JsonRpc.JsonRpcConnection.Send(Object payload) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\JsonRpc\JsonRpcConnection.cs:line 150
at MiningCore.Stratum.StratumClient`1.Notify[T](JsonRpcRequest`1 request) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumClient.cs:line 97
at MiningCore.Stratum.StratumClient`1.Notify[T](String method, T payload) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumClient.cs:line 88
at MiningCore.Blockchain.Monero.MoneroPool.<OnNewJob>b__9_0(StratumClient`1 client) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Blockchain\Monero\MoneroPool.cs:line 279
at MiningCore.Stratum.StratumServer`1.ForEachClient(Action`1 action) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumServer.cs:line 250 System.IndexOutOfRangeException: index: 16384, length: 238 (expected: range(0, 238)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.CheckIndex(Int32 index, Int32 fieldLength)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.CheckSrcIndex(Int32 index, Int32 length, Int32 srcIndex, Int32 srcCapacity)
at NetUV.Core.Buffers.PooledArrayBuffer`1.Set(Int32 index, T[] src, Int32 srcIndex, Int32 length)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.Set(Int32 index, T[] src)
at NetUV.Core.Buffers.ByteBufferExtensions.SetString(IArrayBuffer`1 buffer, Int32 start, String value, Encoding encoding)
at NetUV.Core.Buffers.WritableBuffer.WriteString(String value, Encoding encoding)
at MiningCore.JsonRpc.JsonRpcConnection.Send(Object payload) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\JsonRpc\JsonRpcConnection.cs:line 150
at MiningCore.Stratum.StratumClient`1.Notify[T](JsonRpcRequest`1 request) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumClient.cs:line 97
at MiningCore.Stratum.StratumClient`1.Notify[T](String method, T payload) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumClient.cs:line 88
at MiningCore.Blockchain.Monero.MoneroPool.<OnNewJob>b__9_0(StratumClient`1 client) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Blockchain\Monero\MoneroPool.cs:line 279
at MiningCore.Stratum.StratumServer`1.ForEachClient(Action`1 action) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumServer.cs:line 250    at NetUV.Core.Buffers.AbstractArrayBuffer`1.CheckIndex(Int32 index, Int32 fieldLength)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.CheckSrcIndex(Int32 index, Int32 length, Int32 srcIndex, Int32 srcCapacity)
at NetUV.Core.Buffers.PooledArrayBuffer`1.Set(Int32 index, T[] src, Int32 srcIndex, Int32 length)
at NetUV.Core.Buffers.AbstractArrayBuffer`1.Set(Int32 index, T[] src)
at NetUV.Core.Buffers.ByteBufferExtensions.SetString(IArrayBuffer`1 buffer, Int32 start, String value, Encoding encoding)
at NetUV.Core.Buffers.WritableBuffer.WriteString(String value, Encoding encoding)
at MiningCore.JsonRpc.JsonRpcConnection.Send(Object payload) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\JsonRpc\JsonRpcConnection.cs:line 150
at MiningCore.Stratum.StratumClient`1.Notify[T](JsonRpcRequest`1 request) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumClient.cs:line 97
at MiningCore.Stratum.StratumClient`1.Notify[T](String method, T payload) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumClient.cs:line 88
at MiningCore.Blockchain.Monero.MoneroPool.<OnNewJob>b__9_0(StratumClient`1 client) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Blockchain\Monero\MoneroPool.cs:line 279
at MiningCore.Stratum.StratumServer`1.ForEachClient(Action`1 action) in C:\Users\oweichhold\Projects\MiningCore\src\MiningCore\Stratum\StratumServer.cs:line 250
StormHub commented 7 years ago

Nope, the buffer pool is managed separately. Not part of libuv. The new buffer pool is in dev at the moment.

StormHub commented 7 years ago

I might just want push the new release out from dev.

StormHub commented 7 years ago

Could you just use byte[] for now to resolve the issue. Give me a bit of time to tidy up the new buffer pool and release it. It was something I had been working on for quite some time. It is the last missing piece before ssl.

oliverw commented 7 years ago

@StormHub Allright. Thanks for the heads up and all your help.

StormHub commented 7 years ago

@oliverw Thanks for letting me know. I found the issue, it was the wrong byte offset. But that does not matter much now because I have re written all of those. :)

oliverw commented 7 years ago

@StormHub Hehe. Just let me know when the new buffer stuff is production ready.

oliverw commented 7 years ago

@StormHub Totally weird. Even after the fix the app keeps crashing and crashing. It actually got worse. Now crashing every couple minutes. Users leaving the pool.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `dotnet MiningCore.dll -c /config.json'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4970000cc0 in ?? ()
[Current thread is 1 (Thread 0x7f4977fff700 (LWP 58))]
(gdb) bt
#0  0x00007f4970000cc0 in ?? ()
#1  0x00007f4a65dd5307 in uv__async_event (loop=0x7f4970000b20,
    w=0x7f4970000ce0, nevents=2)
    at /datadrive/teamcity/6a56927d0d6f3630/submodules/libuv/src/unix/async.c:98
#2  0x00007f4a65dd584d in uv__async_io (loop=0x7f4970000b20, w=0x7f4970000ce8,
    events=1)
    at /datadrive/teamcity/6a56927d0d6f3630/submodules/libuv/src/unix/async.c:138
#3  0x00007f4a65dee806 in uv__io_poll (loop=0x7f4970000b20, timeout=-1)
    at /datadrive/teamcity/6a56927d0d6f3630/submodules/libuv/src/unix/linux-core.c:382
#4  0x00007f4a65dd5fcf in uv_run (loop=0x7f4970000b20, mode=UV_RUN_DEFAULT)
    at /datadrive/teamcity/6a56927d0d6f3630/submodules/libuv/src/unix/core.c:352
#5  0x00007f4a09722bda in ?? ()
#6  0x4040000000000000 in ?? ()
#7  0x0000000063420588 in ?? ()
#8  0x00007f4a809b9488 in vtable for InlinedCallFrame ()
   from /usr/share/dotnet/shared/Microsoft.NETCore.App/2.0.0/libcoreclr.so
#9  0x00007f4977ffec58 in ?? ()
#10 0x00007f4a099590c8 in ?? ()
#11 0x00007f4a099562e0 in ?? ()

uv__async_event is probably the result of https://github.com/coinfoundry/miningcore/blob/master/src/MiningCore/JsonRpc/JsonRpcConnection.cs#L168

StormHub commented 7 years ago
  1. Keep reusing the same Async handle
  2. Save the Upstream into the Async.UserToken Also, could you set the trace level to verbose and capture the log?
StormHub commented 7 years ago

by the way, which libuv package are you using? is it the one built from AspNet?

StormHub commented 7 years ago

From the stack, it is clearly not a buffer issue because https://github.com/libuv/libuv/blob/v0.10/src/unix/async.c#L99 uses the local buffer to receive from io.

StormHub commented 7 years ago

Meanwhile, this is actually within libuv io poll, not even passed along to managed code yet.

oliverw commented 7 years ago

@StormHub Good morning Johnny. Sorry for responding late. I simply fell asleep after deploying an emergency fix involving a rewrite of all socket comms using plain old TcpListener just to get this thing running stable over night and getting some sleep after battling this the whole day. 😄

oliverw commented 7 years ago

by the way, which libuv package are you using? is it the one built from AspNet?

https://github.com/coinfoundry/miningcore/blob/master/src/MiningCore/MiningCore.csproj#L44

I'm using the one that satisfies NetUVs NuGet dependency.

StormHub commented 7 years ago

Sorry, I mean the libuv dependency?

oliverw commented 7 years ago

@StormHub I don't have a direct dependency on LibUv. I use what NetUVs references: https://github.com/StormHub/NetUV/blob/dev/src/NetUV.Core/NetUV.Core.csproj#L28

StormHub commented 7 years ago

Right. I am really interested in what happened on linux. Give me a bit time and let me push out the new version with some sort of automatic libuv loop thread management. I cannot change handle interfaces, too many references to those. I have to offer something on top of libuv wrapper. https://github.com/StormHub/NetUV/blob/dev/src/NetUV.Core/Channels/EventLoop.cs And let you try it out.

oliverw commented 7 years ago

@StormHub Looks interesting.

StormHub commented 7 years ago

Basically freeing user code from those marshalling code. That was originally in the stream handle. I pulled all out.

oliverw commented 7 years ago

@StormHub By the way, you wrote:

Keep reusing the same Async handle

I'm not sure how this would work here. I need a closure on the data parameter as there does not seem to be a way to pass state to the Async Handler through AsyncHandle.Send(). Did I miss something?

StormHub commented 7 years ago

How about use some wrapper task to hold the stream handle and data to send, set it to the Async handle UserToken. Now I probably see the problem here, the data lambda capture might be disposed on managed code.

oliverw commented 7 years ago

@StormHub You mean the crash?

StormHub commented 7 years ago

If you want the data to be held until sent, you have to get the handle to hold it. In managed code, delegate might be GC'd if you don't anything to hold it on the managed runtime.

StormHub commented 7 years ago

Yep, it only happens on intensive load, GC triggers much more often.

StormHub commented 7 years ago

Plus, you probably should us ServerGC on prod as well. That would make the GC behave a bit better under intensive load.