`NIOAsyncChannel.executeThenClose` can lead to lost writes

FranzBusch commented 3 months ago

When using the NIOAsyncChannel.executeThenClose it is easy to lose writes. The following code is prone to this

try await channel.executeThenClose { inbound, outbound in
    try await outbound.write(...)
}

The problem here is that while write is async it doesn't wait for the write to actually hit the socket. We did this to avoid allocating a promise for every single write. However, this also means that the write might be buffered in the pipeline or the channel itself. Once we return from the closure of executeThenClose we are calling channel.close() which leads to a forceful closure of the channel. This means that any potentially buffered write might get dropped.

Even when using outbound half closure this can happen since outbound.finish is not async and not waiting for the half closure to be written out.

A few options that I thought about so far:

Allocate a promise for every write. This would trade correctness for performance.
Introduce a new soft close mode in the channel pipeline that allows all writes to get flushed and then close

weissi commented 3 months ago

This is clearly a bug. You have to wait for the write to complete or else you'll violate Structured Concurrency as well the possibility of losing writes.

There's no way out of this.

Note, you could use an empty writeAndFlush() at close and wait for that. That would fix this bug but introduce other bugs (like when the user actually does want to just close) & still violates structured concurrency.

There's no fire & forget in Swift Concurrency and we shouldn't invent one.

FranzBusch commented 3 months ago

I agree with that. I am wondering if we should have two methods write and writeWithoutWaiting so that users can choose which one to pick. The latter would have the potential of dropping writes but if a developer knows he is going to write something afterwards anyways then it would help them avoid promise allocations.

weissi commented 3 months ago

I agree with that. I am wondering if we should have two methods write and writeWithoutWaiting so that users can choose which one to pick. The latter would have the potential of dropping writes but if a developer knows he is going to write something afterwards anyways then it would help them avoid promise allocations.

There's not really room for that. The method can't be synchronous (may need to hop threads to enqueue) and async methods should follow Structured Concurrency. Swift Concurrency chose to not support that, I don't think we should work around it. If fire & forget is something that Swift has appetite for, it should provide a facility for that.

Lukasa commented 3 months ago

Is there any reason this shouldn't use the existing close ratchet functionality we have to default to half-closure? We already know the closure state of the system, and half-closure is supposed to be well-ordered with writes (though it isn't always, that's a bug IMO).

weissi commented 3 months ago

Is there any reason this shouldn't use the existing close ratchet functionality we have to default to half-closure? We already know the closure state of the system, and half-closure is supposed to be well-ordered with writes (though it isn't always, that's a bug IMO).

That should also be the case. But regardless, we need to get the correct error at the correct time. So if you try await ...write(...) then this should get you the syscall error.

This also hits an interesting difference between Netty & NIO: Netty will not send outbound I/O errors through the (inbound) errorCaught. The only way to get outbound errors in Netty is via the write future, NIO behaves differently of course (outbound errors usually also come through errorCaught). Regardless, if a user writes in async land they should get their error and we should have attempted to syscall.

Lukasa commented 3 months ago

So the downside with this proposal is that naive use of the async/await APIs will be very slow: essentially all writes count as a full flush, and cannot progress until the write completes. That's not the way NIO works typically: in the ChannelHandler API you keep writing until the writability state changes, and then delay. Because of NIO's use of promises, these two spellings can be separated, but in async/await they can't.

I'm not confident that the cure isn't worse than the disease here.

FranzBusch commented 3 months ago

Is there any reason this shouldn't use the existing close ratchet functionality we have to default to half-closure? We already know the closure state of the system, and half-closure is supposed to be well-ordered with writes (though it isn't always, that's a bug IMO).

The problem with that is that it only works in half closure cases. If you want to do a write, flush then close it's not going to work. Now arguably that's the same in the NIO world where we often do

context.writeAndFlush(..., promise: nil)
context.close()

The above has the same problem that any intermediate handler could buffer the write and the flush and then we just lose the write because we did a hard close at the end.

I'm not confident that the cure isn't worse than the disease here.

I agree with this though. Making every single write require a promise and a flush is going to make the async interfaces incredibly costly. The only thing that I could come up with so far that addresses both the performance + the guarantee that the sys call to write happened is making it async all the way down i.e. #2648

Lukasa commented 3 months ago

I don't think that's accurate either. If you do write/flush/close(mode: .all) then semantically you've already asked for us to drop all the bytes. I appreciate that's a surprise to many users, but it's the logical implication of the API.

If, however, you do write/flush/close(mode: .output) then it is an error for a handler to reorder that close with the write/flush.

Lukasa commented 3 months ago

While we're here, merely guaranteeing that the write syscall happened is still not enough. If write has completed, and then you call close(mode: .all), the kernel may choose to drop the writes.

If you write, and then want to wait for the bytes to go out, the only supported mode is close(mode: .output). Nothing else works, including #2068.

FranzBusch commented 3 months ago

Right, different proposal then. What if we make writer.finish() async? And we attach a promise to the close(mode: .output). That should give us the guarantee back that the writes made it out.

Lukasa commented 3 months ago

Not really enough either. In the case where the local entity begins the shutdown, merely having shutdown(SHUT_WR) return doesn't provide a sufficient guarantee. We still need to wait for ourselves to read EOF. The shutdown ratchet is fundamentally the right API design: if nothing has been closed at the point where executeAndClose fires, we should close mode output and wait for inbound half closure. If the inbound side is already closed, that will upgrade to immediate closure when close(mode: .output) completes, assuming we attach a promise.

We can also rely on cancellation to override theabove "don't lose data" pattern to upgrade it to "shut up and drop the damn thing", though [robably we don't want to wait forever, some amount of "graceful shutdown timeout" should happen, and will need to be configurable.

Ultimately, the original design of NIOAsyncChannel aimed to make half-closure the default behaviour, and we should continue to aim for it as it's the only thing that prevents this kind of issue.

weissi commented 3 months ago

Not really enough either. In the case where the local entity begins the shutdown, merely having shutdown(SHUT_WR) return doesn't provide a sufficient guarantee.

Sure, I understand. But we would still communicate everything the kernel tells us. Yes, that doesn't mean the other side has received it but as least we didn't eat up a return value we could've provided.

Ultimately, the original design of NIOAsyncChannel aimed to make half-closure the default behaviour, and we should continue to aim for it as it's the only thing that prevents this kind of issue.

Yes, half-closure is great but that doesn't imply full fire and forget mode

Lukasa commented 3 months ago

Yes, half-closure is great but that doesn't imply full fire and forget mode

Right, but that's my argument. Back when NIOAsyncChannel used deinit-based cleanup, it would do half-closure more-or-less by default. executeAndClose discarded that choice, which I'm arguing is wrong.

FranzBusch commented 3 months ago

I think we have two orthogonal but related points here.

executeThenClose is currently prone to dropping writes. We should try to ratchet this down by doing half closure
outbound.write isn't surfacing the I/O errors instead they come in through inbound.next()

For 1. we have a solution in mind but 2. is difficult without creating a promise for every write. However, even in pure NIO we rarely create promises for writes and instead rely on the error coming through errorCaught. That's why we went with bubbling up the error in inbound.next().

weissi commented 3 months ago

is difficult without creating a promise for every write.

I don't think there can be a correct solution without a promise for every write. Why don't we want a promise for every write? Swift Concurrency's model forces us into doing this and one extra allocation isn't killing anybody here. We'll be doing a syscall anyway which is orders of magnitude more expensive.

For multi-write things where the allocation would indeed suck we can find different options.

Lukasa commented 3 months ago

How do we handle the inefficient write pattern caused by promise on every write?

weissi commented 3 months ago

How do we handle the inefficient write pattern caused by promise on every write?

A syscall will come in at ~10us, an allocation will come in at maybe 50ns in the p50 case (so almost 3 orders of mag difference). Given that almost every write will cause a syscall anyway, I think that's okay.

If you have multiple things to send I assume we do or will provide a batch-writing API anyway, right?

And if one thing to write gets split into multiple writes by lower-level handlers like the HTTP encoder you won't suffer. So yes, a HTTP response will be +1 allocations in total but I really think that's okay. The model forces us into that.

Lukasa commented 3 months ago

If you have multiple things to send I assume we do or will provide a batch-writing API anyway, right?

Yes, we do.

And if one thing to write gets split into multiple writes by lower-level handlers like the HTTP encoder you won't suffer. So yes, a HTTP response will be +1 allocations in total but I really think that's okay. The model forces us into that.

Sure, but a HTTP response shouldn't be just one write. The body has to be streamed, and so it's n writes. Yes, for users who write their programs carefully the cost of this can be mitigated, but for those who don't this will get pretty gnarly pretty fast.

FranzBusch commented 3 months ago

Sure, but a HTTP response shouldn't be just one write. The body has to be streamed, and so it's n writes. Yes, for users who write their programs carefully the cost of this can be mitigated, but for those who don't this will get pretty gnarly pretty fast.

I am specifically thinking about proxy use-cases. They would consume an inbound async sequence containing the individual body parts and then write them out one by one. Since async sequences are currently not capable of batching elements it will lead to single writes.

weissi commented 3 months ago

Sure, but a HTTP response shouldn't be just one write. The body has to be streamed, and so it's n writes. Yes, for users who write their programs carefully the cost of this can be mitigated, but for those who don't this will get pretty gnarly pretty fast.

I am specifically thinking about proxy use-cases. They would consume an inbound async sequence containing the individual body parts and then write them out one by one. Since async sequences are currently not capable of batching elements it will lead to single writes.

I mean sure, but I don't think that's an argument why we should do something incorrect. If you need the highest possible performance, you'll drop down anyway. Currently, the proxy use case will anyway be slow because of the thread hops which also cost ~3 orders of magnitude more than that single promise. I really don't think this argument holds.

FranzBusch commented 3 months ago

Currently, the proxy use case will anyway be slow because of the thread hops which also cost ~3 orders of magnitude more than that single promise.

The thread hops can be completely avoided with Swift 6 and TaskExecutors. (Task executors currently allocate heavily during runtime which makes them practically worse then taking the hops but that can be fixed)

weissi commented 3 months ago

Currently, the proxy use case will anyway be slow because of the thread hops which also cost ~3 orders of magnitude more than that single promise.

The thread hops can be completely avoided with Swift 6 and TaskExecutors.

It's getting better, yes but I've yet to see an example where this fully works.

(Task executors currently allocate heavily during runtime which makes them practically worse then taking the hops but that can be fixed)

Again, allocations are almost 3 orders of magnitude cheaper than hops. In other words, you can probably allocate 1,000 times to make up a thread hop.

Bottom line: I don't see why we're arguing that saving ~50ns is worth doing the wrong thing. Especially given that 99% of the time we're doing a syscall after.

Lukasa commented 3 months ago

I think it does. I believe all three of us in this conversation believe the thread hops are a resolvable issue, either with task executors or by taking over the global executor. Our stated goal is to get to a place where async/await is not unavoidably slower than the NIO pipeline, and it strikes me as a mistake to do things we know will have the effect of pessimising that performance.

As for "doing something incorrect", I don't think it's easy to see how this is any more incorrect than what NIO programs tend to do by default. The overwhelming majority of NIO programs don't attach promises to their writes, instead allowing the general error handling pattern to pick them up and terminate. Certainly the ones that care about maximum performance do.

A more useful framing of this conversation might be the flipping of a default. The original API of NIO defaults to vector writes: write does nothing, and only flush triggers I/O. If you do write(someData).wait() you get hangs (a bug that almost all NIO users eventually hit). Conversely, the API of NIOAsyncChannel strongly favours scalar writes. Yes, you can do vector writes, but you have to gather them yourself. Any API we added to make this easier (e.g. try await .batch { $0.write(); $0.write(); $0.write() }) would still be less convenient than simply calling try await write three times in a row.

So maybe a more radical redesign is worth considering. How about we remove .write altogether, and only offer a batch style write API. Essentially, force all users to issue write batches.

FranzBusch commented 3 months ago

So maybe a more radical redesign is worth considering. How about we remove .write altogether, and only offer a batch style write API. Essentially, force all users to issue write batches.

IMO the fact that currently the NIOAsyncChannel.outbound has both scalar and vector writes isn't the problem. I am happy to promote the batch writes APIs more but having single scalar writes is never going away. I also don't fully see yet how we can do the batch writes API nicely without introducing allocs. In the end, we need an intermediate storage for the batched writes and then have to go to the writer and write them all at once. We can of course back this by a dynamic array that stack allocates up to a certain number of elements.

Assuming we promote the batch writes API more. Are we in agreement that this API should be backed by a promise for the last write for correctness? If we are then that also means scalar writes should be backed by a promise.

In the end, getting correctness is the most important piece here. Currently, it is too easy to drop writes on the ground even though the code looks like you did the right thing.

Lukasa commented 3 months ago

In the end, we need an intermediate storage for the batched writes and then have to go to the writer and write them all at once.

Do we? We already have intermediate storage in the async writer. I think it is very possible to produce an implementation of this API that does not require temporary storage, though we may choose to use some anyway.

Are we in agreement that this API should be backed by a promise for the last write for correctness?

Yes.

If we are then that also means scalar writes should be backed by a promise.

I don't dispute this, but I do think having scalar writes be easier and more natural than vector writes makes them an attractive nuisance.

weissi commented 3 months ago

As for "doing something incorrect", I don't think it's easy to see how this is any more incorrect than what NIO programs tend to do by default. The overwhelming majority of NIO programs don't attach promises to their writes, instead allowing the general error handling pattern to pick them up and terminate.

Yes, but NIO isn't structured concurrency. It has fire&forget and it has do&(a)wait. Swift Concurrency doesn't have fire & forget so this is incorrect in my view.

A more useful framing of this conversation might be the flipping of a default. The original API of NIO defaults to vector writes: write does nothing, and only flush triggers I/O. If you do write(someData).wait() you get hangs (a bug that almost all NIO users eventually hit). Conversely, the API of NIOAsyncChannel strongly favours scalar writes. Yes, you can do vector writes, but you have to gather them yourself. Any API we added to make this easier (e.g. try await .batch { $0.write(); $0.write(); $0.write() }) would still be less convenient than simply calling try await write three times in a row.

I see an argument for adding a func enqueue(...) async or something API, maybe even one that doesn't throw. Just to communicate "I'm not sending this, I'm just enqueuing this".

So maybe a more radical redesign is worth considering. How about we remove .write altogether, and only offer a batch style write API. Essentially, force all users to issue write batches.

Also an idea yeah

FranzBusch commented 3 months ago

Do we? We already have intermediate storage in the async writer. I think it is very possible to produce an implementation of this API that does not require temporary storage, though we may choose to use some anyway.

I think I see what you mean. It would require us to take the lock for each write but we would have a separate internal flush method that we call at the end of the batched write closure.

I don't dispute this, but I do think having scalar writes be easier and more natural than vector writes makes them an attractive nuisance.

Probably a minor point but how are we going to keep scalar writes from not being as attractive. In the end doing writer.batched { $0.write() } isn't nice and we should provide a writer.unbatchedWrite() API at least. Maybe the name is enough to convey that developers should think of batching when possible.

adam-fowler commented 3 months ago

I would be happy with a writer.batched { $0.write() } style API. It means I can write a whole HTTP request regardless of content and know it will be written once I return but at the same time I only pay for one promise allocation.

Or maybe

outbound.enqueue(.head(response.head))
for try await buffer in response.body {
    outbound.enqueue(.body(buffer))
}
try await outbound.write(.end(nil))

Where the promise generated for the write at the end is enough for me to know the rest of the request has been written.

Lukasa commented 3 months ago

Probably a minor point but how are we going to keep scalar writes from not being as attractive. In the end doing writer.batched { $0.write() } isn't nice

I mean...that's the point :wink:. Sounds like it's working as designed!

You're right though: making a nice scalar API will almost always make it nicer than a batched API. I'm suggesting, though not necessarily very strongly, that we should at least consider simply not doing that.

To @adam-fowler's point, I'm a touch nervous about enqueue, because it reintroduces a fairly common NIO bug today: "I forgot to flush". The nice thing about a .with or .batched style API is you can't forget to flush (or not flush, if you throw an error).

adam-fowler commented 3 months ago

The nice thing about a .with or .batched style API is you can't forget to flush (or not flush, if you throw an error).

I'd be happy with batched or batchedWrite, it aligns with similar APIs elsewhere.

FranzBusch commented 3 months ago

I just want to point out that the following code that one might come up with is super prone to the I forgot to flush NIO bug. We are potentially waiting for an infinite amount of body data. So in the case this is proxying we might only flush out data once we have seen the whole request/response.

So the batched APIs do push developers to think about batches but they also might provide a foot gun when they start to stream inside a single batch.

writer.batched { batch in
  for try await buffer in response.body {
      batch.append(buffer)
  }
}

weissi commented 3 months ago

[...] I'm a touch nervous about enqueue, because it reintroduces a fairly common NIO bug today: "I forgot to flush".

Indeed! To me, the nicest property about Swift Structured Concurrency is that it's one of the very few concurrency systems that makes it hard to build up unbounded queues by accident.

Non-reentrant actors for example make it nice & easy to program a complex state flow without having to use an explicit state machine. But their mailboxes are essentially unbounded queues as the send message operation is often a fire&forget one (e.g. Erlang). Swift deliberately doesn't allow this -- which of course comes at a cost (need state machines in actors if you have complex operations that await inside).

This however hinges on one important requirement (which I don't think is as widely known as it should be) that you must never have a loop around group.addTask { ... } unless you also impose a fixed limit on how many concurrent subtasks you spawn (e.g. a max of 100 group.addTasks before waiting for one to finish). (and avoid Task.*, if you need, also limit how many you do in parallel).

If you follow this principle and your code uses Structured Concurrency, then you profit:

Every single task only does one thing at a time. If it triggers background/concurrent work, it'll do so in a bounded way and it will await the completion before doing more background work
Adding a child task is explicit via group.addTask and should always be limited to a fixed number in any given piece of code. async let is automatically limited because you have a finite number of lines of code :)
For a server that means that by just limiting the number of connections/streams you'll accept in parallel will limit a lot of the resources you spend on that

enqueue or a fire&forget write which IMHO definitely violates Structured Concurrency start to undo the guarantees above because they don't fully play by the rules. The return from an async function without actually having done all the work that they triggered... For write() async throws I think this is simply unacceptable. For enqueue() async I also share Cory's concerns but I think it could be argued that this may be a (temporarily) necessary evil which picks a name (enqueue) which makes it clear that we're building a (potentially unbounded) queue here. Ideally we'd avoid this but maybe we need it? I think this is still tbd.

weissi commented 3 months ago


writer.batched { batch in
  for try await buffer in response.body {
      batch.append(buffer)
  }
}

Well, Structured Concurrency doesn't mean "your code is bug free" or "your code is free of unbounded queues". It means that your resources follow the structure of your code.

So yes, the code you wrote is not ideal but under Structured Concurrency you could at least use the standard tools (swift inspect dump-concurrency, I know slight non-Darwin problem here :( ) to spot where you messed up.

If you forget a flush() in regular NIO, there's no standard tooling that can tell you what code enqueued the stuff. But in Structured Concurrency there will be because you know that the code structure that's responsible is still alive!

FranzBusch commented 3 months ago

"your code is free of unbounded queues"

Well is it in the above? This surely includes an unbounded queue in the batch.append since it is non-asynchronous and non-throwing.

writer.batched { batch in
  for try await buffer in response.body {
      batch.append(buffer)
  }
}

weissi commented 3 months ago

"your code is free of unbounded queues"

Well is it in the above? This surely includes an unbounded queue in the batch.append since it is non-asynchronous and non-throwing.

What I wrote was

Well, Structured Concurrency doesn't mean "your code is bug free" or "your code is free of unbounded queues".

Note the doesn't.

But it does allow you to find the culprit because you have a "guarantee" (if everybody plays by the Structured Concurrency rules which this code does) that the offending piece of code is still "on (async) stack".

apple / swift-nio

`NIOAsyncChannel.executeThenClose` can lead to lost writes #2795