Add EventLoop APIs for simpler scheduling of callbacks

simonjbeaumont commented 3 months ago

Motivation

The current scheduleTask APIs make use of both callbacks and promises, which leads to confusing semantics. For example, on cancellation, users are notified in two ways: once via the promise and once via the callback. Additionally the way the API is structured results in unavoidable allocations—for the closures and the promise—which could be avoided if we structured the API differently.

Modifications

This PR introduces new protocol requirements on EventLoop:

protocol EventLoop {
    // ...
    @discardableResult
    func scheduleCallback(at deadline: NIODeadline, handler: some NIOScheduledCallbackHandler) throws -> NIOScheduledCallback

    @discardableResult
    func scheduleCallback(in amount: TimeAmount, handler: some NIOScheduledCallbackHandler) throws -> NIOScheduledCallback

    func cancelScheduledCallback(_ scheduledCallback: NIOScheduledCallback)
}

Default implementations have been provided that call through to EventLoop.scheduleTask(in:_:) to not break existing EventLoop implementations, although this implementation will be (at least) as slow as using scheduleTask(in:_:) directly.

The API is structured to allow for EventLoop implementations to provide a custom implementation, as an optimization point and this PR provides a custom implementation for SelectableEventLoop, so that MultiThreadedEventLoopGroup can benefit from a faster implementation.

Finally, this PR adds benchmarks to measure the performance of setting a simple timer using both scheduleTask(in:_:) and scheduleCallback(in:_:) APIs using a MultiThreadedEventLoopGroup.

Result

A simpler and more coherent API surface.

There is also a small performance benefit for heavy users of this API, e.g. protocols that make extensive use of timers: when using MTELG to repeatedly set a timer with the same handler, switching from scheduleTask(in:_:) to scheduleCallback(in:_:) reduces almost all allocations (and amortizes to zero allocations) and is ~twice as fast.

MTELG.scheduleCallback(in:_:)
╒═══════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total) *      │       0 │       0 │       0 │       0 │       0 │       0 │       0 │    1109 │
╘═══════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

MTELG.scheduleTask(in:_:)
╒═══════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═══════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total) *      │       4 │       4 │       4 │       4 │       4 │       4 │       4 │     576 │
╘═══════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

simonjbeaumont commented 3 months ago

@swift-server-bot test this please

simonjbeaumont commented 2 months ago

In order to support cancellation, I'm gonna wait to land https://github.com/apple/swift-nio/pull/2800.

simonjbeaumont commented 2 months ago

OK, cancellation support added. FYI @glbrntt @Lukasa @FranzBusch

weissi commented 2 months ago

Motivation

Scheduling a timer is currently an expensive operation because this is done using EventLoop.scheduleTask(in:_:), which results in several allocations. A simple timer use case does not typically need all the functionality of this API and use cases that set many timers (e.g. for timeouts) pay this cost repeatedly.

@simonjbeaumont Could you add information here on why EventLoop.scheduleTask results in more allocations than this? It shouldn't, maybe that's fixable? EventLoop.scheduledTask is meant to kinda be the simplest timer API but the PR contrasts it with "simple timers" which suggests that I'm missing something here.

Also how many allocations are we talking? And do you have something real-worldish where that difference is observable?

simonjbeaumont commented 2 months ago

Motivation

Scheduling a timer is currently an expensive operation because this is done using EventLoop.scheduleTask(in:_:), which results in several allocations. A simple timer use case does not typically need all the functionality of this API and use cases that set many timers (e.g. for timeouts) pay this cost repeatedly.

@simonjbeaumont Could you add information here on why EventLoop.scheduleTask results in more allocations than this? It shouldn't, maybe that's fixable? EventLoop.scheduledTask is meant to kinda be the simplest timer API but the PR contrasts it with "simple timers" which suggests that I'm missing something here.

Great questions, and sorry if the write up in the PR description wasn't detailed enough.

IIUC, scheduleTask will allocate a bunch because it takes closures and needs promises. For cases where you just want a callback to an object you own, e.g. for setting repeated timers, this can be wasteful.

Also how many allocations are we talking?

I have added benchmarks and their results to the PR description. We're talking 4 allocations for using scheduleTask and none for scheduleCallback when using SelectableEventLoop.

Note that the default implementation for event loops of scheduleCallback is backed by scheduleTask so there won't be a win here, but it provides an opportunity for event loops to provide a custom implementation, which this PR has for SelectableEventLoop.

And do you have something real-worldish where that difference is observable?

A real-world use case would be something like QUIC, where IIUC it makes use of lots of timers in the protocol. I also think that grpc-swift is likely to benefit from this.

One thing this discussion has prompted me to do is to benchmark repeated scheduleCallback vs using scheduleRepeatedTask to see if there is some amortised benefit to the latter that I've overlooked.

FranzBusch commented 2 months ago

@simonjbeaumont Could you add information here on why EventLoop.scheduleTask results in more allocations than this? It shouldn't, maybe that's fixable? EventLoop.scheduledTask is meant to kinda be the simplest timer API but the PR contrasts it with "simple timers" which suggests that I'm missing something here.

FWIW, I tried to make EL.scheduleTask as cheap as possible in the past already and tried to minimise allocations. IRCC I hit a point where I couldn't remove any more of the allocations because that would have had public API impact. One of them is for example the future that we allocate which we can completely avoid with this new API.

Lukasa commented 2 months ago

Candidly, there's almost no way to make scheduleTask fast in the current API. scheduleTask is required to return a Scheduled<T>, which looks like this:

https://github.com/apple/swift-nio/blob/28f9cae996fd25c157251ced7205f399bd38cc0b/Sources/NIOCore/EventLoop.swift#L26-L53

This has the minimum cost of 1 promise, which is 1 allocation. However, in SelectableEventLoop it requires two allocations, as our cancellationTask closes over self and a taskId, so we have to heap-allocate the closure context:

https://github.com/apple/swift-nio/blob/28f9cae996fd25c157251ced7205f399bd38cc0b/Sources/NIOPosix/SelectableEventLoop.swift#L322-L331

But it's worse! The internal datastructure that SelectableEventLoop uses to store the scheduled task is a ScheduledTask (natch). That looks like this:

https://github.com/apple/swift-nio/blob/28f9cae996fd25c157251ced7205f399bd38cc0b/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift#L510-L534

Note that we have two more closures. How many of them heap allocate? At least one:

https://github.com/apple/swift-nio/blob/28f9cae996fd25c157251ced7205f399bd38cc0b/Sources/NIOPosix/SelectableEventLoop.swift#L306-L312

As all closures over closures must allocate.

All this scaffolding means that this timer cannot help but cause 3 allocations. None of these can go away without losing the ability to handle cancellation, or breaking the existing API surface: the promise is clearly necessary, the cancel function is API and can't be removed (though we could do a deprecation cycle to remove it, probably), and the mere existence of the promise in alloc 1 forces alloc 3.

Additionally, the nature of .scheduleTask as an API encourages you to take another allocation, because it takes a closure. With careful design you can make that go away (by closing only over a class reference), but most users won't notice this.

weissi commented 2 months ago

Candidly, there's almost no way to make scheduleTask fast in the current API.

Would it be worth thinking about improving the API with something that is very similar but doesn't suffer all these issues?

I have added benchmarks and their results to the PR description. We're talking 4 allocations for using scheduleTask and none for scheduleCallback when using SelectableEventLoop.

Right, in a naive benchmark, 4 mallocs & 4 frees of allocations that are small and don't last very long is 59 nanoseconds on my machine. Please don't interpret that as saving allocations is not worthwhile, quite the opposite! But inventing duplicate APIs should be well motivated in my opinion.

``` // run as swiftc -O test.swift && time ./test import Darwin @inline(never) @_optimize(none) func blackhole( _ a: UnsafeMutableRawPointer, _ b: UnsafeMutableRawPointer, _ c: UnsafeMutableRawPointer, _ d: UnsafeMutableRawPointer ) { free(a) free(b) free(c) free(d) } for _ in 0..<1_000_000_000 { guard let a = malloc(24), let b = malloc(32), let c = malloc(48), let d = malloc(48) else { preconditionFailure() } blackhole(a, b, c, d) } // $ swiftc -O test.swift && time ./test // // real 0m59.349s // user 0m58.970s // sys 0m0.360s ```

A real-world use case would be something like QUIC, where IIUC it makes use of lots of timers in the protocol. I also think that grpc-swift is likely to benefit from this.

I meant a real world use case where the ScheduledTask overhead is large enough that it's actually observable. I understand that QUIC, gRPC and many others use a lot of scheduled tasks, but 59 ns is very very little. Add an order of magnitude, let's call it 500 ns, that's still very little. So we need to be confident that any processing we would do in the timers doesn't totally dwarf the allocation time.

Lukasa commented 2 months ago

Why are we adding only one order of magnitude? Each of these timers is kept per-connection, so the correct order of multiplicand is "number of connections / number of cores".

This API can be made very similar to scheduleTask, by having it take a closure instead. In that case, it's an almost perfect replacement, except that it drops the problematic promise. This does open the user up to the sharp edge of the closure context, but expert users can avoid that sharp edge.

But I'd argue this makes your objection worse, not better. That API is much closer to the existing one than the current proposal, which increases confusion instead of decreasing it. We cannot deprecate the existing API, because it's a protocol requirement: event loops must implement it, and indeed it must be used as the fallback implementation for the new API.

Lukasa commented 2 months ago

As a related sidebar, these timers are typically set on the event loop, so they are a source of latency across the loop as a whole: we can't do I/O for any time where we're allocating memory for these timers.

weissi commented 2 months ago

Why are we adding only one order of magnitude? Each of these timers is kept per-connection, so the correct order of multiplicand is "number of connections / number of cores".

I meant adding an order of magnitude for each. In a best-case scenario 4 allocs&frees are 59ns (all the four, not each one). I'm suggesting to assume they 4 allocs & 4 frees come in at 500ns (again, sum of 4 allocs & 4 frees).

This API can be made very similar to scheduleTask, by having it take a closure instead. In that case, it's an almost perfect replacement, except that it drops the problematic promise. This does open the user up to the sharp edge of the closure context, but expert users can avoid that sharp edge.

Yeah, that's what I mean, why not add an API that's the same as the current one except for no ScheduledTask return (and other niche things).

But I'd argue this makes your objection worse, not better. That API is much closer to the existing one than the current proposal, which increases confusion instead of decreasing it. We cannot deprecate the existing API, because it's a protocol requirement: event loops must implement it, and indeed it must be used as the fallback implementation for the new API.

Just so I get it right, your argument is that having a new scheduleTaskFireAndForget(in: TimeAmount, body: @escaping () -> Void) -> Void (name tbd) is confusing? I can sympathise with that but I think we can find a resolution.

My thinking was/is: Let's start with a use case that makes an overhead of 50 to 500ns per schedule actually observable, then let's see what that use case needs. Maybe it needs cancellation support, maybe it doesn't, who knows. It makes it easier to design a new API if we actually now precisely what's necessary.

For example: I mostly do need the returned ScheduledTask such that I'm able to clean up. And I also need the .futureResult usually such that I can schedule the next time (if not expressible by repeated schedule task).

The protocol requirement isn't an issue IMHO. The current API is quite expressive and supports a lot of use cases. But I won't deny that there are use cases where we don't need all features. (I have had those too)

As a related sidebar, these timers are typically set on the event loop, so they are a source of latency across the loop as a whole: we can't do I/O for any time where we're allocating memory for these timers.

Completely. But there's much more overhead than just allocations. Literally the only thing I'm suggesting is a motivation that shows this as an actual win. I know it's a theoretical win but just saving the allocations might be small enough that it's not obversable.

And finally: Something that takes 500ns (which is the 10x'd cost of our allocations) can be done 200M per second per core (but that'd fully consume it). So that's a lot.

Lukasa commented 2 months ago

I'm struggling a bit with this feedback, because when I read it it seems like you're asking for two separate things, without clarifying whether they're an AND or an OR.

Yeah, that's what I mean, why not add an API that's the same as the current one except for no ScheduledTask return (and other niche things).

The API is almost identical in function to the current one. It's not fire-and-forget. It's not even that we don't return a ScheduledTask: this API supports cancellation by returning a token, it notifies the callee about that cancellation so we can handle EL quiescing, so it's just as capable of being cancelled as the existing API. The only thing it doesn't have is a Promise.

This is because the Promise on the current one is weird. When we complete or cancel a Scheduled right now, we don't tell you once, we tell you twice: once via Promise and once via callback. That's a very confused idea: why are there two ways? These are much more naturally two APIs: one that fires a callback, and one that takes a promise:

protocol EventLoop {
    func scheduleTask(in: TimeAmount, @Sendable @escaping () -> Void, onCancel: @Sendable @escaping () -> Void) -> TaskHandle

    func scheduleTask(in: TimeAmount, notifying promise: EventLoopPromise<Void>) -> TaskHandle
}

struct TaskHandle {
    func cancel()
}

Of course, each of these APIs could easily be implemented in terms of the other.

But the current API is weirdly both of these at the same time. You pass a callback, but you also get a promise. The two are both notified in the same way. That's weird, and hard to justify.

The new API says "screw it, just do the callback". But it acknowledges that we already have an API and it can do quite a lot, so in an attempt to differentiate itself it becomes far more restrictive. Essentially, it forces the user into a pattern that will allow amortized zero-allocation timers, by ensuring that you always give us something to call rather than us just invoking a closure.

But as you know very well, there is no difference between a delegate and a pair of closures, so we could take the pair of closures instead. The downside is that this API gets very close to obviating ScheduledTask, and it gets quite hard for us to tell users when they should use one or the other. I honestly thought the better thing was to provide a more differentiated API, not a less differentiated one.

While we're here, the 59ns argument isn't interesting because Si has already written a benchmark, which is available in his original post but I will reproduce here:

MTELG.scheduleTask(in:_:)
╒═════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                  │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total) *        │       4 │       4 │       4 │       4 │       4 │       4 │       4 │     520 │
├─────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (total CPU) (ns) * │     447 │     719 │     773 │     834 │     908 │    1098 │    1348 │     520 │
╘═════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

MTELG.setTimer(for:_:) without NIOCustomTimerImplementation conformance
╒═════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                  │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total) *        │       5 │       5 │       5 │       5 │       5 │       5 │       5 │     476 │
├─────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (total CPU) (ns) * │     482 │     760 │     823 │     905 │     966 │    1119 │    1364 │     476 │
╘═════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

MTELG.setTimer(for:_:) with NIOCustomTimerImplementation conformance
╒═════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                  │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞═════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total) *        │       0 │       0 │       0 │       0 │       0 │       0 │       0 │    1071 │
├─────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (total CPU) (ns) * │     178 │     278 │     327 │     383 │     434 │     535 │     664 │    1071 │
╘═════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

P50 saves 446ns from a 773ns operation, p90 saves 474ns from a 908ns operation, and p99 saves 563ns from a 1.1us operation.

All of these are things that you can get away with doing millions of times in a second, but all of these are things that are likely to be done many thousands of times per second. As an example, QUIC connections will typically set at least two per-packet timers per connection (idle timer and probe timer). Using ScheduledTask in this context gives little room to move, and the EL offers no other timer solution. If we need a cheaper one, we can work out how to do that, but I don't think we have many ways to get cheaper than the implementation offered here.

weissi commented 2 months ago

Thanks @Lukasa for writing this. I think this is the missing motivation. I don't think the real motivation are the savings:

P50 saves 446ns from a 773ns operation, p90 saves 474ns from a 908ns operation, and p99 saves 563ns from a 1.1us operation.

That's nice and exactly in line with the prediction of 59ns just for allocs which we upped by 10x (to account for worse cases and the other overhead like initialising the classes in those allocs) to 500ns. Of course that's fantastic to save this but without a use case where that makes even a 1% difference I didn't (and don't) think it's worth to invent new API today before that use case actually is there.

But your post does give the real reason (which is much much better than the performance): The current API is weird and not as performant as it could be. I still think a real world use case motivating doing the change now would be nice (as opposed to when the use case is there which would provide more information). But I think your reasoning is good. My feedback to @simonjbeaumont is then: Copy Cory's post as the motivation (weird API) with the nice side effect: Your programs may also run a tiny bit faster if you schedule loads of timers.

But as you know very well, there is no difference between a delegate and a pair of closures, so we could take the pair of closures instead. The downside is that this API gets very close to obviating ScheduledTask, and it gets quite hard for us to tell users when they should use one or the other. I honestly thought the better thing was to provide a more differentiated API, not a less differentiated one.

I would've thought making it more aligned is better as it makes it easier for people to migrate from weird API to better API but I can see different arguments here.

Last question: Do we think we need to keep the old API (apart for SemVer reasons) or can we make the new API a complete replacement?

simonjbeaumont commented 2 months ago

OK, @weissi; I updated the PR description to emphasise the API clarity and demote the performance win as a nice side-effect for the use cases for which it would matter.

weissi commented 2 months ago

OK, @weissi; I updated the PR description to emphasise the API clarity and demote the performance win as a nice side-effect for the use cases for which it would matter.

Awesome, that makes much more sense to me (maybe adjust the title too).

simonjbeaumont commented 1 month ago

Looks like @weissi is happy now with the motivation (not sure if that included a code review), and @glbrntt has approved.

@Lukasa, @FranzBusch: this looks like it's waiting for a re-review from one/both of you.

simonjbeaumont commented 1 month ago

Gentle _ping. 🥺

apple / swift-nio