Open allright opened 5 years ago
I have just profiled whats happening on the "stall moment"
You can open perf-kernel.svg in any Browser to look performance Graph Perf.zip
Too much objects release in the same moment blocks Event Loop. Can we fix it? Workarounds:
Is it possible to schedule 50% of event loop time to handle all events except releasing objects, and 50% for other tasks? May be we need something like Managed GarbageCollector (or " smooth release objects manager" may be thing like DisposeBag in RxSwift ? )
Tools used for perf monitoring: http://www.brendangregg.com/perf.html http://www.brendangregg.com/perf.html#TimedProfiling
ouch, thanks @allright , we'll look into that
One more possible design - is provide the FAST custom allocator/deallocator (like in std C++) for promises. Which really have preallocated memory & not really calls malloc/free every time object deallocated or calls it one time for big group of objects. So my idea is group allocations/deallocations 1 alloc for 1000 promises, or 1 alloc/dealloc per second. So we can attach this custom allocator/deallocator to each EventLoop.
Another possible design - is object reuse pool. Really, it can preallocate many needed objects at the app start & deallocate it only on app stop. Or manage it automatically. Real server application usually tuned on the place for maximum possible connections/speed - so we do not need real retain/dealloc during app life (just only on start/stop).
@weissi What do you think?
@allright Swift unfortunately doesn’t let you choose the allocator. It will always use malloc
. Also from your profile it seems to be the reference counting rather than the allocations, right?
@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.
@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.
but the reference counting operations are inserted automatically by the Swift compiler. They happen when something is used. Let's say you write this
func someFunction(_ foo: MyClass()) { ... }
let object = MyClass()
someFunction(object)
object.doSomething()
then the Swift compiler might emit code like this:
let object = MyClass() // allocates it with reference count 1
object.retain() // ref count + 1, to pass it to someFunction
someFunction(object)
object.retain() // ref count + 1, for the .doSomething call
object.doSomething()
object.release() // ref count - 1, because we're out of someFunction again
object.release() // ref count - 1, because we're doing with .doSomething
object.release() // ref count - 1, because we no longer need `object`
certain reference counts can be optimised but generally Swift is very noisy with ref counting operations and we can't remove them with object pools.
Yes. Not all. But for example, Channel Handlers may be allocated/deallocated using factory
let chFactory = currentEventLoop().getFactory() // or createFactoryWithCapacity(1000) let channelHandler = chFactory.createChannelHandler() // real allocation here (or get from preallocated)
// use channelHandler
chFactor.release(channelHandler) // ask chFactory that this channelHandler may be reused // no release or retain here! let reusedChannelHandler = chFactory.createChannelHandler() // reinited channel handler
So - use this way for each object like Promise and etc..
@allright sure, you could even implement this today for ChannelHandler
s. The problem is the number of reference count operations will be the same.
Yes, the number of operations is the same, but the moment of operations is not the same. We can make this operations at the app exit, or when no load on event loop. We are able to prioritise operation, and give event loop time to most important tasks (like send/receive messages to already established connections).
Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? We have to find out best solution.
Yes, the number of operations is the same, but the moment of operations is not the same.
That's not totally accurate. If you take out a handler from a pipeline, there will be reference counts changed, whether that handler will be re-used or not. Sure, if they are re-used, then you don't need do deallocate which causes even more reference count decreases.
We can make this operations at the app exit, or when no load on event loop. We are able to prioritise operation, and give event loop time to most important tasks (like send/receive messages to already established connections).
Really, how to use swift-nio framework on the big production servers with millions connections? One event loop per 6000 handlers? We have to find out best solution.
Totally agreed. I'm just saying that caching your handlers (which you can do today, you don't need anything from NIO) won't remove all reference count traffic when tearing down the pipeline.
I see. Let's try to fix what we can and test!) Even preventing massive deallocations will improve performance.
Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.
So lets optimise alloc/dealloc speed by reuse pools or any other way.
I see. Let's try to fix what we can and test!)
What you could do is store a thread local NIOThreadLocal<CircularBuffer<MyHandler>>
on every event loop. Then you can
let threadLocalMyHandlers = NIOThreadLocal<CircularBuffer<MyHandler>>(value: .init(capacity: 32))
extension EventLoop {
func makeMyHandler() -> MyHandler {
if threadLocalMyHandlers.value.count > 0 {
return threadLocalMyHandlers.value.removeFirst()
} else {
return MyHandler()
}
}
}
and in MyHandler
:
func handlerRemoved(context: ChannelHandlerContext) {
self.resetMyState()
threadLocalMyHandlers.value.append(self)
}
(code not tested or compiled, just as an idea)
Even preventing massive deallocations will improve performance.
agreed
good idea) will test later)
Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.
__swift_retain_...HeapObject
is also just in increment of the reference count. Allocation is swift_alloc
, swift_allocObject
and swift_slowAlloc
.
The reason HeapObject
is in the symbol name of __swift_retain...HeapObject
is because it's written in C++ and in C++ the parameter types are name-mangled into the symbol name.
Also, as in performance graph I think reference count change not takes a lot of time. There are __swift_retain_HeapObject - means allocation, but not increment reference count only.
__swift_retain_...HeapObject
is also just in increment of the reference count. Allocation isswift_alloc
,swift_allocObject
andswift_slowAlloc
.
hm ...
CircularBuffer<MyHandler>
I have just tested this, but it is not enough (a lot of Promises cause retain/release), and these promises must be reused too. But I figured out that stalls happen while massive handlerRemoved function called. So I think the best solution will be to automatically distribute in time invokeHandlerRemoved() ... calling. It must be not > 100 invokeHandlerRemoved() invokes per second (for example) - depends of CPU performance. May be add special deferred queue for call invokeHandlerRemoved() ??? It will be smart garbage collector per EventLoop. @weissi is it possible to apply this workaround?
handlerRemoved
is invoked in a separate event loop tick by way of a deferred event loop execute call. Netty provides a hook to limit the number of outstanding tasks that execute in any event loop tick. While I don't know if we want the exact same API, we may want to investigate whether we should provide tools to prevent scheduled tasks from starving I/O operations.
"limit the number of outstanding tasks that execute in any event loop tick" Yes, EventLoop mechanics means that every operation is very small. And only prioritisation can help in this case. I think it is good Idea. Two not dependent ways for optimise:
In Real world: We have limited resources on server. Simple example 1 CPU Core + 1 GB Ram (it may covers up to 100000 tcp connection or 20000 ssl). So real server will be tuned and limited for maximum connections due to RAM & CPU limitations. And.....
Server do not need dynamic memory allocation/deallocation during processing. Swift-nio pipeline:
EchoHandler() -> BackPressureHandler() -> IdleStateHandler() -> ... some other low level handlers like TCP and etc... We can preallocate and reuse 100000 Pipelines with All they needs, not only Handlers, but all Promises too:
EchoHandler: 100000 BackPressureHandler: 100000 IdleStateHandler: 100000 Promise: 10* 100000 = 1000000
It completely solves our problem - no massive allocations/deallocations during processing.
Possible steps to implement:
P.S. I faced with slow accepting of incoming TCP connections in comparison with C++ boost::asio. So I think the reason is slow memory allocation.
I have gotten an issue using the Vapor based on the SwiftNIO (https://github.com/vapor/vapor/issues/1963) I guess it belongs to this issue. Does any workaround exist?
@AnyCPU your issue isn't related to this.
@weissi is it related to SwiftNIO?
is it related to SwiftNIO?
I don't think so but we'd need more information to be 100% sure. Let's discuss this on the Vapor issue tracker.
I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing. I think we also have a problem on massive accepting a lot of new incoming tcp connections in comparison with ASIO (cpp library).
I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing.
Your graph above shows that most of the overhead is in retain
and release
. That would not go away if we pre-allocated.
@AnyCPU your issue isn't related to this.
I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)
I think it is related to Swift-NIO architect. Too much Feature/Promise alloc/dealloc per one connection. The only way is preallocate and defer deallocation of resources during processing.
Your graph above shows that most of the overhead is in
retain
andrelease
. That would not go away if we pre-allocated.
I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.
Could you test this hypothesis?
I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.
It's an atomic increment and decrement. Just check your own profile, the ZN14__swift_retain_...
is just an atomic ++
I'm not think that atomic increment/decrement of retain count takes a lot of time. But the very first malloc and the last free - takes a lot of time.
It's an atomic increment and decrement. Just check your own profile, the
ZN14__swift_retain_...
is just an atomic++
could you provide link to implementation?
Event it is only increment & decrement, I think we can change the moment when increment/decrement occurs. And also my profile does not show malloc/free. I think this operations under retain/release (not shown on the graph).
ZN14__swift_retain_...
is just an atomic++
could you provide link to implementation?
@allright btw, if you want faster (de)allocations, you might want to look at http://jemalloc.net
Swift & NIO should work out of the box with that
@weissi , Do you think that this function too slow? https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/stdlib/public/SwiftShims/RefCount.h#L736
Look at this comment: https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/benchmark/single-source/ObjectAllocation.swift#L15
It says - problem in alloc/dealloc // 53% _swift_release_dealloc // 30% _swift_alloc_object // 10% retain/release
@weissi , Do you think that this function too slow? https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/stdlib/public/SwiftShims/RefCount.h#L736
Well 'slow' is relative but in your profile up top, that function took about 30% of the time.
Look at this comment: https://github.com/apple/swift/blob/48d8ebd1b051fba09d09e3322afc9c48fabe0921/benchmark/single-source/ObjectAllocation.swift#L15
Yes, but that very very much depends on the state of the CPU caches and what's on which cacheline etc.
// 53% _swift_release_dealloc // 30% _swift_alloc_object // 10% retain/release
Really it means that 83% of time takes malloc/free. So may be http://jemalloc.net - is good decision.
@weissi Yes, I think it is reference counting. So we still can use special factories (objects reuse pools) for lightweight create/free objects, attached to each EventLoop. In these pools object may not really to be deallocated, just reinit before next allocation.
but the reference counting operations are inserted automatically by the Swift compiler. They happen when something is used. Let's say you write this
func someFunction(_ foo: MyClass()) { ... } let object = MyClass() someFunction(object) object.doSomething()
then the Swift compiler might emit code like this:
let object = MyClass() // allocates it with reference count 1 object.retain() // ref count + 1, to pass it to someFunction someFunction(object) object.retain() // ref count + 1, for the .doSomething call object.doSomething() object.release() // ref count - 1, because we're out of someFunction again object.release() // ref count - 1, because we're doing with .doSomething object.release() // ref count - 1, because we no longer need `object`
certain reference counts can be optimised but generally Swift is very noisy with ref counting operations and we can't remove them with object pools.
Also this code is not a problem, if retain/release in the middle takes only 10% of time. So we can optimise malloc/free on the architect level of Swift NIO. Using preallocated pools of objects, that never frees. Usually server do not need free memory, but it must be benchmarked to get understanding of how much connections can handle. So if we need server for 100000 connections, we allocate it once at start, and never deallocate!
But we must to avoid alloc/free for Feature/Promises in the pipeline. So:
@AnyCPU your issue isn't related to this.
I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)
I have run the wrk tool using two profiles: 1) wrk -t 1 -d 15s -c 1000 http://localhost:8080 2) wrk -t 10 -d 15s -c 1000 http://localhost:8080
The issue occurs with the first option on very first run. The issue does not occur with the second option. I tried to run many times.
I hope it will somehow help.
@AnyCPU your issue isn't related to this.
I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)
I have run the wrk tool using two profiles:
- wrk -t 1 -d 15s -c 1000 http://localhost:8080
- wrk -t 10 -d 15s -c 1000 http://localhost:8080
The issue occurs with the first option on very first run. The issue does not occur with the second option. I tried to run many times.
I hope it will somehow help.
I means that you can increase the number of threads in EventLoopGroup: let numberOfThreads = 32 let group = MultiThreadedEventLoopGroup(numberOfThreads: numberOfThreads)
So. Do not use the same machine for running wrk, it consumes the CPU, and have influence to your swift-nio server behaviour. Tests are incorrect.
@AnyCPU & @allright Please, let's separate the issues here. @AnyCPU your issue is unrelated to this here. @AnyCPU let's discuss your issue on the Vapor bug tracker.
But we must to avoid alloc/free for Feature/Promises in the pipeline. So:
- no ALLOC/FREE per Packet/Event.
The only allocations that happen per packet/event are the buffer for the bytes. There are no futures/promises allocated. If you set a custom ByteBufferAllocator
on the Channel
you can also remove those allocations by ByteBuffer
s. So per packet/event no allocations are necessary.
@AnyCPU your issue isn't related to this.
I recommend you to do workaround. So create more threads (approx not more 5000 connections per thread)
I have run the wrk tool using two profiles:
- wrk -t 1 -d 15s -c 1000 http://localhost:8080
- wrk -t 10 -d 15s -c 1000 http://localhost:8080
The issue occurs with the first option on very first run. The issue does not occur with the second option. I tried to run many times. I hope it will somehow help.
I means that you can increase the number of threads in EventLoopGroup: let group = MultiThreadedEventLoopGroup(numberOfThreads: "PUT HERE A LOT OF THREADS")
So. Do not use the same machine for running wrk, it consumes the CPU, and have influence to your swift-nio server behaviour. Tests are incorrect.
But we must to avoid alloc/free for Feature/Promises in the pipeline. So:
- no ALLOC/FREE per Packet/Event.
The only allocations that happen per packet/event are the buffer for the bytes. There are no futures/promises allocated.
Yes. No problem here. The problem is in massive deallocations, during massive channel closing. So we have to prevent it on channel/socket close.
The problem is in massive deallocations (and retaining!!!) if look on my graph, during massive channel closing. So we have to prevent it on channel/socket close
We can't really do anything about the retaining unfortunately, that's due to ARC. And with jemalloc I wouldn't hope that in a real-world application the deallocations lead to massive issues. Closing a (Socket
)Channel
is always expensive because we need to talk to the kernel to close the file descriptor, deregister from kqueue/epoll etc.
Yes, closing socket is expensive. But graph shows retain/release problem, not socket close. I'm not sure that is not massive issue. Hi Load http server - very often opens and closes connections. For example - if you have 1 Gbit network (or even 10 Gbit!). You can open 1 000 000 connections! How to handle it? People will use C++, not swift. But why? Swift fast enough, event in C++ ASIO we can have alloc/dealloc problem. Swift == C++ with std::shared_ptr every there. So, lets think how to use it correctly. So my opinion it is not swift problem, but it is swift-nio architect problem. Really, I compared with C++ ASIO, and for ASIO now there are no problem to open/close 3000...5000 new connections per second and more on ONE THREAD without affecting other connections.
So for Swift-NIO current normal closing/open connections speed is < 1000 connections per second for 1 thread. Real limit for thread is 5000 RPS (because massive open/close can suspend this thread for time about: 5..10 seconds to close 5000 connections, and affect over connection handlers, processed by this thread).
Yes, this benchmarks is good, but it can be better even on swift I think.
And next steps may be:
I hope, I will deep in this issue during next several months. Right now I have no time for it:(
But graph shows retain/release problem
Again, we can't do much against retain
being expensive. And retain is about 30% alone. Release will be as expensive as retain but it's harder to judge because a release might trigger a deallocation. Looking at retain
is easier because it only ever retains the object.
Swift == C++ with std::shared_ptr every there.
Except that ARC at the moment inserts a lot more retains/releases than you'd typically see in C++.
If you find time, what would be really interesting to see is:
EventLoopGroup
. At the moment, Swift doesn't have a memory model so we need to use locks to implement the event loop. We hope to move to an atomic dequeue there-assume-single-threaded
make a difference (obviously only use it with MultiThreadedEventLoopGroup(numberOfThreads: 1)
master
branch, the first version of Semantic ARC has landed there and it makes a big differencejemalloc
make a differenceThanks @weissi. I'll keep in mind.
Expected behavior
[what you expected to happen] no stalls on socket close
Actual behavior
[what actually happened] stalls for 10-30 seconds, up to disconnect by timeout
Steps to reproduce
video: https://yadi.sk/i/ZmAu8La5zLWfSg. ( to look in the best quality you can download file) sources: https://github.com/allright/swift-nio-load-testing/tree/master/swift-nio-echo-server commit: https://github.com/allright/swift-nio-load-testing/commit/a461c72f2adce2e6fabbb981307166178ac2e397
VPS: 1 CPU 512 RAM ubuntu 16.0.4
root@us-san-gate0:~/swift-nio-load-testing/swift-nio-echo-server# cat Package.resolved { "object": { "pins": [ { "package": "swift-nio", "repositoryURL": "https://github.com/apple/swift-nio.git", "state": { "branch": "nio-1.13", "revision": "29a9f2aca71c8afb07e291336f1789337ce235dd", "version": null } }, { "package": "swift-nio-zlib-support", "repositoryURL": "https://github.com/apple/swift-nio-zlib-support.git", "state": { "branch": null, "revision": "37760e9a52030bb9011972c5213c3350fa9d41fd", "version": "1.0.0" } } ] }, "version": 1 }
Swift version 4.2.3 (swift-4.2.3-RELEASE) Target: x86_64-unknown-linux-gnu Linux us-san-gate0 4.14.91.mptcp #12 SMP Wed Jan 2 17:51:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
PS. the same echo server, but implemented on C++ ASIO, has not such problem. Can apply source codes(C++) & video if needed