too big CPU load & event loop stall on massive socket close

allright commented 5 years ago

Expected behavior

[what you expected to happen] no stalls on socket close

Actual behavior

[what actually happened] stalls for 10-30 seconds, up to disconnect by timeout

Steps to reproduce

video: https://yadi.sk/i/ZmAu8La5zLWfSg. ( to look in the best quality you can download file) sources: https://github.com/allright/swift-nio-load-testing/tree/master/swift-nio-echo-server commit: https://github.com/allright/swift-nio-load-testing/commit/a461c72f2adce2e6fabbb981307166178ac2e397

VPS: 1 CPU 512 RAM ubuntu 16.0.4

download and compile swift-nio-echo-server (based on NIOEchoServer from swift-nio)
compile release & run
connect to server manually by telnet client
run tcpkali -c 20000 --connect-rate=3000 --duration=10000s --latency-connect -r 1 -m 1 echo-server.url:8888
wait until server will have > 15000 connections
during wait type in telnet & look echo response immediately
stop tcpkali by Ctrl+C
type in telnet & DO NOT RECEIVE ANY RESPONSE!
Wait some time 10...30 seconds, until all connections will be closed by timeout
Type in telnet & have echo response immediately (sometimes telnet may be disconnected by timeout 30 sec in code)

root@us-san-gate0:~/swift-nio-load-testing/swift-nio-echo-server# cat Package.resolved { "object": { "pins": [ { "package": "swift-nio", "repositoryURL": "https://github.com/apple/swift-nio.git", "state": { "branch": "nio-1.13", "revision": "29a9f2aca71c8afb07e291336f1789337ce235dd", "version": null } }, { "package": "swift-nio-zlib-support", "repositoryURL": "https://github.com/apple/swift-nio-zlib-support.git", "state": { "branch": null, "revision": "37760e9a52030bb9011972c5213c3350fa9d41fd", "version": "1.0.0" } } ] }, "version": 1 }

Swift version 4.2.3 (swift-4.2.3-RELEASE) Target: x86_64-unknown-linux-gnu Linux us-san-gate0 4.14.91.mptcp #12 SMP Wed Jan 2 17:51:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

PS. the same echo server, but implemented on C++ ASIO, has not such problem. Can apply source codes(C++) & video if needed

allright commented 5 years ago

I have just profiled whats happening on the "stall moment"

You can open perf-kernel.svg in any Browser to look performance Graph Perf.zip

Too much objects release in the same moment blocks Event Loop. Can we fix it? Workarounds:

Is it possible to schedule 50% of event loop time to handle all events except releasing objects, and 50% for other tasks? May be we need something like Managed GarbageCollector (or " smooth release objects manager" may be thing like DisposeBag in RxSwift ? )
1. Schedule channel release at random time after closing socket from client?
2. Closing 25000 connects in one thread cause 30 seconds hang, but if I make 4 EventLoops - telnet hangs only for 7.5 second. So not more 6000 connections per event loop is possible.

Tools used for perf monitoring: http://www.brendangregg.com/perf.html http://www.brendangregg.com/perf.html#TimedProfiling

2019-03-11_10-12-10

weissi commented 5 years ago

ouch, thanks @allright , we'll look into that

allright commented 5 years ago

One more possible design - is provide the FAST custom allocator/deallocator (like in std C++) for promises. Which really have preallocated memory & not really calls malloc/free every time object deallocated or calls it one time for big group of objects. So my idea is group allocations/deallocations 1 alloc for 1000 promises, or 1 alloc/dealloc per second. So we can attach this custom allocator/deallocator to each EventLoop.

Another possible design - is object reuse pool. Really, it can preallocate many needed objects at the app start & deallocate it only on app stop. Or manage it automatically. Real server application usually tuned on the place for maximum possible connections/speed - so we do not need real retain/dealloc during app life (just only on start/stop).

@weissi What do you think?

weissi commented 5 years ago

@allright Swift unfortunately doesn’t let you choose the allocator. It will always use malloc. Also from your profile it seems to be the reference counting rather than the allocations, right?

allright commented 5 years ago

@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.

weissi commented 5 years ago

@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.

but the reference counting operations are inserted automatically by the Swift compiler. They happen when something is used. Let's say you write this

func someFunction(_ foo: MyClass()) { ... }

let object = MyClass()
someFunction(object)
object.doSomething()

then the Swift compiler might emit code like this:

let object = MyClass() // allocates it with reference count 1
object.retain() // ref count + 1, to pass it to someFunction
someFunction(object)
object.retain() // ref count + 1, for the .doSomething call
object.doSomething()
object.release() // ref count - 1, because we're out of someFunction again
object.release() // ref count - 1, because we're doing with .doSomething
object.release() // ref count - 1, because we no longer need `object`

certain reference counts can be optimised but generally Swift is very noisy with ref counting operations and we can't remove them with object pools.

allright commented 5 years ago

Yes. Not all. But for example, Channel Handlers may be allocated/deallocated using factory

let chFactory = currentEventLoop().getFactory() // or createFactoryWithCapacity(1000) let channelHandler = chFactory.createChannelHandler() // real allocation here (or get from preallocated)

// use channelHandler chFactor.release(channelHandler) // ask chFactory that this channelHandler may be reused // no release or retain here! let reusedChannelHandler = chFactory.createChannelHandler() // reinited channel handler

So - use this way for each object like Promise and etc..

weissi commented 5 years ago

@allright sure, you could even implement this today for ChannelHandlers. The problem is the number of reference count operations will be the same.

allright commented 5 years ago

Yes, the number of operations is the same, but the moment of operations is not the same. We can make this operations at the app exit, or when no load on event loop. We are able to prioritise operation, and give event loop time to most important tasks (like send/receive messages to already established connections).

Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? We have to find out best solution.

weissi commented 5 years ago

Yes, the number of operations is the same, but the moment of operations is not the same.

That's not totally accurate. If you take out a handler from a pipeline, there will be reference counts changed, whether that handler will be re-used or not. Sure, if they are re-used, then you don't need do deallocate which causes even more reference count decreases.

We can make this operations at the app exit, or when no load on event loop. We are able to prioritise operation, and give event loop time to most important tasks (like send/receive messages to already established connections).

Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? We have to find out best solution.

Totally agreed. I'm just saying that caching your handlers (which you can do today, you don't need anything from NIO) won't remove all reference count traffic when tearing down the pipeline.

allright commented 5 years ago

I see. Let's try to fix what we can and test!) Even preventing massive deallocations will improve performance.

allright commented 5 years ago

Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.

So lets optimise alloc/dealloc speed by reuse pools or any other way.

weissi commented 5 years ago

I see. Let's try to fix what we can and test!)

What you could do is store a thread local NIOThreadLocal<CircularBuffer<MyHandler>> on every event loop. Then you can

let threadLocalMyHandlers = NIOThreadLocal<CircularBuffer<MyHandler>>(value: .init(capacity: 32))
extension EventLoop {
    func makeMyHandler() -> MyHandler {
        if threadLocalMyHandlers.value.count > 0 {
            return threadLocalMyHandlers.value.removeFirst()
        } else {
            return MyHandler()
        }
    }
}

and in MyHandler:

func handlerRemoved(context: ChannelHandlerContext) {
    self.resetMyState()
    threadLocalMyHandlers.value.append(self)
}

(code not tested or compiled, just as an idea)

Even preventing massive deallocations will improve performance.

agreed

allright commented 5 years ago

good idea) will test later)

weissi commented 5 years ago

Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.

__swift_retain_...HeapObject is also just in increment of the reference count. Allocation is swift_alloc, swift_allocObject and swift_slowAlloc.

The reason HeapObject is in the symbol name of __swift_retain...HeapObject is because it's written in C++ and in C++ the parameter types are name-mangled into the symbol name.

allright commented 5 years ago

Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.

__swift_retain_...HeapObject is also just in increment of the reference count. Allocation is swift_alloc, swift_allocObject and swift_slowAlloc.

hm ...

allright commented 5 years ago

CircularBuffer<MyHandler>

I have just tested this, but it is not enough (a lot of Promises cause retain/release), and these promises must be reused too. But I figured out that stalls happen while massive handlerRemoved function called. So I think the best solution will be to automatically distribute in time invokeHandlerRemoved() ... calling. It must be not > 100 invokeHandlerRemoved() invokes per second (for example) - depends of CPU performance. May be add special deferred queue for call invokeHandlerRemoved() ??? It will be smart garbage collector per EventLoop. @weissi is it possible to apply this workaround?

Lukasa commented 5 years ago

handlerRemoved is invoked in a separate event loop tick by way of a deferred event loop execute call. Netty provides a hook to limit the number of outstanding tasks that execute in any event loop tick. While I don't know if we want the exact same API, we may want to investigate whether we should provide tools to prevent scheduled tasks from starving I/O operations.

allright commented 5 years ago

"limit the number of outstanding tasks that execute in any event loop tick" Yes, EventLoop mechanics means that every operation is very small. And only prioritisation can help in this case. I think it is good Idea. Two not dependent ways for optimise:

Reuse objects (all promises and channel handlers myst be reused to prevent massive alloc/dealloc)
Prioritisation (one of possible implementations is limiting not hi priority tasks per one tick).

allright commented 5 years ago

In Real world: We have limited resources on server. Simple example 1 CPU Core + 1 GB Ram (it may covers up to 100000 tcp connection or 20000 ssl). So real server will be tuned and limited for maximum connections due to RAM & CPU limitations. And.....

Server do not need dynamic memory allocation/deallocation during processing. Swift-nio pipeline:

EchoHandler() -> BackPressureHandler() -> IdleStateHandler() -> ... some other low level handlers like TCP and etc... We can preallocate and reuse 100000 Pipelines with All they needs, not only Handlers, but all Promises too:

EchoHandler: 100000 BackPressureHandler: 100000 IdleStateHandler: 100000 Promise: 10* 100000 = 1000000

It completely solves our problem - no massive allocations/deallocations during processing.

Possible steps to implement:

Move the ownership of all Promises to the common base ChannelHandler class.
Make the factory interface for creating & reiniting ChannelHandlers in ReusePool. May be easier will be reuse the ChannelPipeline object (I'm still not have deep diving into source codes yet).

P.S. I faced with slow accepting of incoming TCP connections in comparison with C++ boost::asio. So I think the reason is slow memory allocation.

AnyCPU commented 5 years ago

I have gotten an issue using the Vapor based on the SwiftNIO (https://github.com/vapor/vapor/issues/1963) I guess it belongs to this issue. Does any workaround exist?

weissi commented 5 years ago

@AnyCPU your issue isn't related to this.

AnyCPU commented 5 years ago

@weissi is it related to SwiftNIO?

weissi commented 5 years ago

is it related to SwiftNIO?

I don't think so but we'd need more information to be 100% sure. Let's discuss this on the Vapor issue tracker.

allright commented 5 years ago

I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing. I think we also have a problem on massive accepting a lot of new incoming tcp connections in comparison with ASIO (cpp library).

weissi commented 5 years ago

I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing.

Your graph above shows that most of the overhead is in retain and release. That would not go away if we pre-allocated.

allright commented 5 years ago

@AnyCPU your issue isn't related to this.

I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)

allright commented 5 years ago

I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing.

Your graph above shows that most of the overhead is in retain and release. That would not go away if we pre-allocated.

malloc, retainCount++. 0->1. (a lot of time)
retainCount++ 1->2
retainCount++ 2->3 ....
retainCount-- 3->2
retainCount-- 2->1
retainCount--, free() 1->0 (a lot of time)

I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.

Could you test this hypothesis?

weissi commented 5 years ago

I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.

It's an atomic increment and decrement. Just check your own profile, the ZN14__swift_retain_... is just an atomic ++

allright commented 5 years ago

I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.

It's an atomic increment and decrement. Just check your own profile, the ZN14__swift_retain_... is just an atomic ++

could you provide link to implementation?

allright commented 5 years ago

Event it is only increment & decrement, I think we can change the moment when increment/decrement occurs. And also my profile does not show malloc/free. I think this operations under retain/release (not shown on the graph).

weissi commented 5 years ago

ZN14__swift_retain_... is just an atomic ++

could you provide link to implementation?

https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/stdlib/public/runtime/HeapObject.cpp#L299

weissi commented 5 years ago

@allright btw, if you want faster (de)allocations, you might want to look at http://jemalloc.net

weissi commented 5 years ago

Swift & NIO should work out of the box with that

allright commented 5 years ago

@weissi , Do you think that this function too slow? https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/stdlib/public/SwiftShims/RefCount.h#L736

allright commented 5 years ago

2019-04-23_22-46-55

Look at this comment: https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/benchmark/single-source/ObjectAllocation.swift#L15

It says - problem in alloc/dealloc // 53% _swift_release_dealloc // 30% _swift_alloc_object // 10% retain/release

weissi commented 5 years ago

@weissi , Do you think that this function too slow? https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/stdlib/public/SwiftShims/RefCount.h#L736

Well 'slow' is relative but in your profile up top, that function took about 30% of the time.

weissi commented 5 years ago

Look at this comment: https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/benchmark/single-source/ObjectAllocation.swift#L15

Yes, but that very very much depends on the state of the CPU caches and what's on which cacheline etc.

allright commented 5 years ago

// 53% _swift_release_dealloc // 30% _swift_alloc_object // 10% retain/release

Really it means that 83% of time takes malloc/free. So may be http://jemalloc.net - is good decision.

allright commented 5 years ago

@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.

but the reference counting operations are inserted automatically by the Swift compiler. They happen when something is used. Let's say you write this
func someFunction(_ foo: MyClass()) { ... }

let object = MyClass()
someFunction(object)
object.doSomething()
then the Swift compiler might emit code like this:
let object = MyClass() // allocates it with reference count 1
object.retain() // ref count + 1, to pass it to someFunction
someFunction(object)
object.retain() // ref count + 1, for the .doSomething call
object.doSomething()
object.release() // ref count - 1, because we're out of someFunction again
object.release() // ref count - 1, because we're doing with .doSomething
object.release() // ref count - 1, because we no longer need `object`
certain reference counts can be optimised but generally Swift is very noisy with ref counting operations and we can't remove them with object pools.

Also this code is not a problem, if retain/release in the middle takes only 10% of time. So we can optimise malloc/free on the architect level of Swift NIO. Using preallocated pools of objects, that never frees. Usually server do not need free memory, but it must be benchmarked to get understanding of how much connections can handle. So if we need server for 100000 connections, we allocate it once at start, and never deallocate!

allright commented 5 years ago

But we must to avoid alloc/free for Feature/Promises in the pipeline. So:

no ALLOC/FREE per Packet/Event.
Alloc/Free ONLY PER CONNECTION and ONLY on server START.

AnyCPU commented 5 years ago

@AnyCPU your issue isn't related to this.

I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)

I have run the wrk tool using two profiles: 1) wrk -t 1 -d 15s -c 1000 http://localhost:8080 2) wrk -t 10 -d 15s -c 1000 http://localhost:8080

The issue occurs with the first option on very first run. The issue does not occur with the second option. I tried to run many times.

I hope it will somehow help.

allright commented 5 years ago

@AnyCPU your issue isn't related to this.

I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)

I have run the wrk tool using two profiles:

wrk -t 1 -d 15s -c 1000 http://localhost:8080

wrk -t 10 -d 15s -c 1000 http://localhost:8080

The issue occurs with the first option on very first run. The issue does not occur with the second option. I tried to run many times.

I hope it will somehow help.

I means that you can increase the number of threads in EventLoopGroup: let numberOfThreads = 32 let group = MultiThreadedEventLoopGroup(numberOfThreads: numberOfThreads)

So. Do not use the same machine for running wrk, it consumes the CPU, and have influence to your swift-nio server behaviour. Tests are incorrect.

weissi commented 5 years ago

@AnyCPU & @allright Please, let's separate the issues here. @AnyCPU your issue is unrelated to this here. @AnyCPU let's discuss your issue on the Vapor bug tracker.

weissi commented 5 years ago

But we must to avoid alloc/free for Feature/Promises in the pipeline. So:

no ALLOC/FREE per Packet/Event.

The only allocations that happen per packet/event are the buffer for the bytes. There are no futures/promises allocated. If you set a custom ByteBufferAllocator on the Channel you can also remove those allocations by ByteBuffers. So per packet/event no allocations are necessary.

allright commented 5 years ago

@AnyCPU your issue isn't related to this.

I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)

I have run the wrk tool using two profiles:

wrk -t 1 -d 15s -c 1000 http://localhost:8080

wrk -t 10 -d 15s -c 1000 http://localhost:8080

The issue occurs with the first option on very first run. The issue does not occur with the second option. I tried to run many times. I hope it will somehow help.

I means that you can increase the number of threads in EventLoopGroup: let group = MultiThreadedEventLoopGroup(numberOfThreads: "PUT HERE A LOT OF THREADS")

So. Do not use the same machine for running wrk, it consumes the CPU, and have influence to your swift-nio server behaviour. Tests are incorrect.

But we must to avoid alloc/free for Feature/Promises in the pipeline. So:

no ALLOC/FREE per Packet/Event.

The only allocations that happen per packet/event are the buffer for the bytes. There are no futures/promises allocated.

Yes. No problem here. The problem is in massive deallocations, during massive channel closing. So we have to prevent it on channel/socket close.

weissi commented 5 years ago

The problem is in massive deallocations (and retaining!!!) if look on my graph, during massive channel closing. So we have to prevent it on channel/socket close

We can't really do anything about the retaining unfortunately, that's due to ARC. And with jemalloc I wouldn't hope that in a real-world application the deallocations lead to massive issues. Closing a (Socket)Channel is always expensive because we need to talk to the kernel to close the file descriptor, deregister from kqueue/epoll etc.

allright commented 5 years ago

Yes, closing socket is expensive. But graph shows retain/release problem, not socket close. I'm not sure that is not massive issue. Hi Load http server - very often opens and closes connections. For example - if you have 1 Gbit network (or even 10 Gbit!). You can open 1 000 000 connections! How to handle it? People will use C++, not swift. But why? Swift fast enough, event in C++ ASIO we can have alloc/dealloc problem. Swift == C++ with std::shared_ptr every there. So, lets think how to use it correctly. So my opinion it is not swift problem, but it is swift-nio architect problem. Really, I compared with C++ ASIO, and for ASIO now there are no problem to open/close 3000...5000 new connections per second and more on ONE THREAD without affecting other connections.

So for Swift-NIO current normal closing/open connections speed is < 1000 connections per second for 1 thread. Real limit for thread is 5000 RPS (because massive open/close can suspend this thread for time about: 5..10 seconds to close 5000 connections, and affect over connection handlers, processed by this thread).

Yes, this benchmarks is good, but it can be better even on swift I think.

And next steps may be:

Investigate whats happen on socket close (is it possible to insert debug trace into retain/release increment decrement???)
Minimise alloc/dealloc operations during connection handler open/close (we have influence on it, not on each retain/release, but on malloc/dealloc).
All allocations -> move to INIT phase of server. (It may be special class: SwiftNIOHandlerAllocationPolicy )

I hope, I will deep in this issue during next several months. Right now I have no time for it:(

weissi commented 5 years ago

But graph shows retain/release problem

Again, we can't do much against retain being expensive. And retain is about 30% alone. Release will be as expensive as retain but it's harder to judge because a release might trigger a deallocation. Looking at retain is easier because it only ever retains the object.

Swift == C++ with std::shared_ptr every there.

Except that ARC at the moment inserts a lot more retains/releases than you'd typically see in C++.

If you find time, what would be really interesting to see is:

is it any different with a single-threaded EventLoopGroup. At the moment, Swift doesn't have a memory model so we need to use locks to implement the event loop. We hope to move to an atomic dequeue there
does -assume-single-threaded make a difference (obviously only use it with MultiThreadedEventLoopGroup(numberOfThreads: 1)
please test with Swift from the master branch, the first version of Semantic ARC has landed there and it makes a big difference
does jemalloc make a difference

allright commented 5 years ago

Thanks @weissi. I'll keep in mind.

apple / swift-nio