dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.09k stars 4.7k forks source link

Consider adding Scatter/Gather IO on Stream #25344

Open geoffkizer opened 6 years ago

geoffkizer commented 6 years ago

The Socket class has this already, in the form of Send/Receive operations that take an IList<ArraySegment<byte>>. This isn't exposed up through Stream classes, however.

One specific scenario where this would be useful is writing chunked-encoded request bodies in SocketsHttpHandler. When the user calls the request Stream's Write[Async] method, we need to first write a chunk header with the chunk length followed by \r\n, then write the chunk contents from the user's buffer to the underlying stream (which is either SslStream or NetworkStream). This requires us to either do two writes (which means two kernel calls and means the chunk header will likely be sent in a separate packet), or to copy into a local buffer so we can do a single write. It would be more performant (and simpler) to do a gather write where we pass two buffers, the first containing the chunk header, the second containing the chunk data, passed directly through.

There are various examples in HTTP2 where scatter/gather IO would be useful as well. (Mostly gather, since gather helps deal with the "framing" scenario above. Scatter reads are less useful since you don't control the framing yourself.)

Proposed API:

        public virtual int Read(IReadOnlyList<Memory<byte>> buffers);
        public virtual void Write(IReadOnlyList<ReadOnlyMemory<byte>> buffers);
        public virtual ValueTask<int> ReadAsync(IReadOnlyList<Memory<byte>> buffers, CancellationToken cancellationToken = default);
        public virtual ValueTask WriteAsync(IReadOnlyList<ReadOnlyMemory<byte>> buffers, CancellationToken cancellationToken = default);

The base Stream class would provide default implementations on top of existing Read/Write APIs.

geoffkizer commented 6 years ago

cc @stephentoub

geoffkizer commented 6 years ago

cc @davidfowl -- I can imagine this might be interesting for you as well.

davidfowl commented 6 years ago

Oh nice! Yes I like this. Should we also consider adding overloads for ReadOnlySequence\<byte>?

scalablecory commented 5 years ago

This could work for Sockets cross-plat, and on Linux/Mac for file I/O. On Windows we could emulate it trivially for files. There would be a long period of writing optimized methods for SslStream, compression, etc., but that could be done piecemeal.

Other considerations:

scalablecory commented 4 years ago

We may need these for efficient HTTP/3 if QUIC implementation doesn't implement one of nagle/cork. We should coordinate QuicStream additions with this proposal. CC @jkotalik

scalablecory commented 4 years ago

Following up re: HTTP/3. We ended up implementing this in our QuicStream, as it does not support nagle/cork. We used ROM<ROM<byte>> but IReadOnlyList<ROM<byte>> would work just as well.

It allows us to gather headers, framing, and user data in a single I/O:

https://github.com/dotnet/runtime/blob/a0c51bf56106b5e1ab62ec3450923f2a17e79e69/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs#L372-L422

We would use this in HTTP/1 and HTTP/2 in a similar way.

pepone commented 4 years ago

Having gathered write will allow to optimize the SSL Record size in SslStream, currently if you have several buffers and use separate Write calls it will create separate records adding more overhead

See https://hpbn.co/transport-layer-security-tls/#optimize-tls-record-size

geoffkizer commented 3 years ago

We have defined some gather write operations for QUIC on the QuicStream class. Ideally, these would not be QUIC-specific; i.e. we'd add them to Stream and then implement them in QuicStream.

scalablecory commented 3 years ago

@carlossanlop any questions here or can we move it to API review?

stephentoub commented 3 years ago

I would like to see a prototype of how these will be consumed (both read and write) and perf numbers to demonstrate its meaningful enough to warrant adding more virtuals to Sream. Where are we going to use the scattered read methods? How much does the gathered write support improve http/2 throughput? How are callers going to avoid allocating IReadOnlyList instances?

I'm not against adding these. But we should take any new virtuals on Stream seriously and do our due diligence.

davidfowl commented 3 years ago

How are callers going to avoid allocating IReadOnlyList instances?

This in particular is something I was worried about. Also kestrel would use this for writes but not reads.

@tmds 's custom transport also wanted to use these.

Why not something directly on socket rather than virtuals ok stream?

geoffkizer commented 3 years ago

How are callers going to avoid allocating IReadOnlyList instances?

The idea is you use the same List/array for every send/receive. So you only need one allocation per socket, which doesn't seem like an issue.

Why not something directly on socket rather than virtuals ok stream?

Socket already supports this. https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketTaskExtensions.cs#L38

The point of this would be to support it in non-Socket cases, like SslStream or QuicStream.

geoffkizer commented 3 years ago

How much does the gathered write support improve http/2 throughput?

This is a hard question to answer authoritatively, because there's always a way to avoid using scatter/gather by simply coalescing buffers. That said, coalescing has some cost itself, which can vary based on a variety of factors; and also it's a bit of a pain to do.

@scalablecory said above:

It allows us to gather headers, framing, and user data in a single I/O [for HTTP3]:

And that's a nice convenience, but it's not clear to me that gathered writes on QuicStream are necessary to achieve this. We already achieve this in HTTP1.1/HTTP2 by managing our buffers to specifically enable this.

scalablecory commented 3 years ago

I prototyped this in HTTP/1 and I saw a measurable but very small impact. I don't remember the exact numbers but it was less than 1-2%.

With HTTP/3, I use it to avoid buffering every data frame when content length isn't known, just to prepend a handful of bytes. I haven't bothered to measure in this case because it was such an obvious and easy win.

geoffkizer commented 3 years ago

I prototyped this in HTTP/1

For chunked encoding, I'm guessing?

It does seem like chunked encoding and HTTP3 framing when length is unknown are pretty much ideal use cases for this. It avoids a potentially large buffer copy.

scalablecory commented 3 years ago

I prototyped this in HTTP/1

For chunked encoding, I'm guessing?

I believe I was also deferring sending headers buffer until sending content when possible, to send them out in one go. There were a few opportunities.

(With nodelay on, this can result in 1 packet instead of 2 for some requests)

tmds commented 3 years ago

The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs.

When implementing a protocol using System.IO.Pipelines, you anyway end up copying because there is no way to pass Memory from one Pipe to another.

stephentoub commented 3 years ago

The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs.

This is exactly why I was looking for concrete perf data.

Socket already supports this

To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types), so either it would need to access new internal methods, or Socket would need new public ones.

geoffkizer commented 3 years ago

To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types)

True. It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way.

scalablecory commented 3 years ago

It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way.

The Windows path is actually fairly easy (I did a minimal version of this for my prototyping), but the Linux PAL is scary huge.

geoffkizer commented 3 years ago

There's a lot of code in the Linux PAL stuff, but it should be relatively straightforward.

scalablecory commented 3 years ago

Confirmed this morning that we can implement this in our SChannel code. This would give us more efficient TLS, as each SslStream.Write() needs to write not just data but an envelope.

stephentoub commented 3 years ago

Numbers?

davidfowl commented 3 years ago

I think we should do this and get some numbers to back it up. We'd need a basic minimal implementation in order to compare apples to apples. Here's the logic that SSL Stream uses to write TLS record frames (which have a max size of 16K) to the wrapped Stream.

https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L623-L632

This could change into logic that stored a list of 16K buffers and passed it to the next layer (which should be more efficient).

geoffkizer commented 3 years ago

I think we should do this and get some numbers to back it up.

+1

This could change into logic that stored a list of 16K buffers and passed it to the next layer (which should be more efficient).

"More efficient" in what sense? I'm trying to understand what we are proving out here.

davidfowl commented 3 years ago

"More efficient" in what sense? I'm trying to understand what we are proving out here.

2 ways of the top of my head:

I don't think we would add the APIs if it didn't have any performance improvements, its just that we need to actually add them to measure.

geoffkizer commented 3 years ago

I understand the potential benefits here. I'm just not sure what specific experiment you are proposing here, and how it would demonstrate these. What's the A vs B here?

davidfowl commented 3 years ago

I understand the potential benefits here. I'm just not sure what specific experiment you are proposing here, and how it would demonstrate these. What's the A vs B here?

scalablecory commented 3 years ago

I'm working on proving this out right now.

The scenarios I have in mind for writes are:

And for reads:

The metrics I'm gathering are:

From prior experience, I can say that in C++ using scatter/gather is a very easy win. The main concern I have here is how much overhead our current PALs are adding, there might be some vague line where copying is cheaper than managing multiple buffers.

It's also easy to say "memory bandwidth is cheap so lets just not complicate the API". We should certainly weight that cost, though my opinion is generally that -- .NET being a cloud-native platform -- we should try to avoid leaving perf on the table when it comes to I/O.

tmds commented 3 years ago

For the proposed API, there is no indication to the user how many buffers makes sense to provide for reading.

System.IO.Pipelines has some support for multi-buffer reading (as ReadResult), and writing (via PipeWriter). It does come at the cost of copying from pipe to pipe.

If you care mostly about saving syscalls for Socket, and less about there being copies, NetworkStream can handle this without extending Stream API. On a recvmsg it can provide some additional buffers. Next calls to Read from the NetworkStream can deplete these buffers before making a new syscall. For writing, NetworkStream can act as a BufferedStream backed by multiple buffers. On Flush any buffered data gets sent.

davidfowl commented 3 years ago

These APIs IMO makes more sense for writing than for reading. The servers doesn't use NetworkStream so only adding it there won't benefit the server stack. There are more APIs that support gathering IO for writes (IIS and HTTP.sys for example support this for writing). It's needed at the Stream layer because of things like SSLStream that wrap another stream and need to chunk up writes into 16K pieces before writing to the underlying stream.

This would also be added to the PipeWriter abstraction so that PipeWriterStream could take advantage of it (without copying). We could pass the segment list to the underlying stream in a single write. The DefaultPipeWriter would of course need to copy but that's besides the point (it also copies today for single buffer writes).

davidfowl commented 3 years ago

To hammer home that this isn't just a NetworkStream concern:

scalablecory commented 3 years ago

Okay, I'm done benchmarking on Windows for now. I can go further and create more scenarios if needed.

I tested four different I/O strategies:

Some takeaways:

POST with two HttpContent Writes():

I/O strategy Number of I/Os Speed
Buffered + Gathered 1 136,502 (+464%)
Buffered 2 76,269 (+215%)
Unbuffered + Gathered 2 74,539 (+208%)
Unbuffered 7 24,172 (+0%)

POST with one HttpContent Write():

I/O strategy Number of I/Os Speed
Buffered 1 160,789 (+370%)
Unbuffered + Gathered 1 152,958 (+347%)
Buffered + Gathered 1 150,085 (+339%)
Unbuffered 5 34,183 (+0%)

Both of these also provide a meaningful approximation of coalescing writes in HTTP/2 vs buffering. Beyond avoiding copying everything, this will provide some scalability benefit by improving concurrency between streams.

Code here.

edit: updated with TCP_NODELAY turned on, which cleaned up some numbers but didn't meaningfully change the results.

scalablecory commented 3 years ago

For the proposed API, there is no indication to the user how many buffers makes sense to provide for reading.

I don't expect many people to use the scattered read API -- it's a lot less generally applicable than gathered writes. It will mostly be beneficial for segmented buffering strategies, especially ones that recycle buffers.

This is the buffering strategy I'm using for HTTP/2 in LLHTTP and I think Pipelines (@davidfowl correct me if I'm wrong) does the same thing.

You have three segments that make up your read buffer, and the 1st segment is partially filled:

▓▓▓▓░ ░░░░░ ░░░░░

You can make a read to fill in what is left of the 1st buffer, but in that case you're incurring the cost of an I/O for what might be a small number of bytes. And then two more I/O for the the subsequent buffers.

OR

You can do a scattered read to fill in all of that last chunk AND the remaining free buffers in a single I/O.

I've also used this when filling a single-array ring buffer -- you can read into the back half and the front half in a single I/O.

davidfowl commented 3 years ago

That's a nice transparent update that could be made when using the StreamPipeReader today if we added the read overload.

The other thing I've come to notice while investigating concurrent connection memory is the fact that each layer of the stack tries to buffer to avoid doing multiple writes when they can. For example, when you do a websocket send, we allocate a buffer internally, copy the user buffer into it, and add framing before the underlying send:

https://github.com/dotnet/runtime/blob/82ca681cbac89d813a3ce397e0c665e6c051ed67/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs#L534

This is expensive for larger buffers but we don't change logic here based on how big the payload is.

This pattern also happens with TLS, we internally allocate a buffer that's payload + frame header and copy the user buffer again into it and write to the underlying Stream:

https://github.com/dotnet/runtime/blob/bcc7e0e0ce1758322015675356efc121b211e312/src/libraries/System.Net.Security/src/System/Net/Security/SslStream.Implementation.cs#L637

Kestrel isn't absolved in this, it also copies that buffers written by SSL Stream into it's internal buffers and then writes a big batch to the socket.

The Socket APIs then internally copy from the user mode buffer to the kernel buffer. We have on number of sys calls here by using a single call to write with many buffers, but we still pay the price of the copy.

I was sending 1MB payloads and the time spend in memcpy is astonishing 😄 .

I think this API could have a big impact in networking scenarios like this where we can remove internal buffers feel better preserving the N buffers for as long as possible until we hit a part of the stack that has to buffer or merge or transform.

geoffkizer commented 3 years ago

I was sending 1MB payloads and the time spend in memcpy is astonishing

Got data?

davidfowl commented 3 years ago

https://github.com/dotnet/aspnetcore/issues/31110

geoffkizer commented 3 years ago

I'm trying to figure out what conclusions to draw from this data.

An obvious one is, fewer IOs are better.

But when performing the same # of IOs, buffering seems faster than gathering (either buffered + gathered or unbuffered + gathered).

Doesn't this imply we should just do buffering?

geoffkizer commented 3 years ago

BTW, I ran the "two HttpContent Writes" scenario with an 8K write buffer (instead of 4K). Here's what I got:

I/O strategy Number of I/Os Speed
Buffered 1 114,048
Buffered + Gathered 1 104,672
Unbuffered + Gathered 2 57,353
Unbuffered 7 18,398

Again, # of IOs is most important; again, Buffered beats both Gathered strategies.

What's also interesting here, though, is that Unbuffered + Gathered suffers hugely because it now has to perform two separate writes. And that's because gathering does not allow coalescing across user writes, whereas buffering does.

davidfowl commented 3 years ago

You don't want to buffer at all of the layers because that's wasted CPU cycles copying. Especially for larger buffers, it's wasteful. The APIs we expose today require the caller to provide the buffer, so we need to do something with it, and the options are to copy or to write directly (when there's no transform of the data itself required). I can see 2 approaches:

As an example of this, consider a chunked encoding write implementation. WriteAsync wants to wrap the user provided data with a header and a footer:

If data is small enough (say < 2K), then it makes sense to copy that into an internal buffer before sending. But lets say the user wrote 65K (large file transfer or something). Then doing multiple writes or a single write with multiple buffers likely make more sense.

BTW, the multi-buffer writes is just a way of pushing that eventual copy further down the stack. It doesn't stop the eventual copy, just the intermediate ones.

davidfowl commented 3 years ago

@geoffkizer with those numbers did you max out the CPU? You should also try it to with multiple buffer sizes to see where the tipping point is.

geoffkizer commented 3 years ago

@geoffkizer with those numbers did you max out the CPU?

No, I'm simply taking Cory's test from above and modifying it slightly.

I do agree it would be better to do a real throughput test here and max out CPU, but that's not what the code does currently.

davidfowl commented 3 years ago

Based on what I said above, I think there are some scenarios where this is clearly going to be a win. We need to find the right heuristic though, because doing it all the time definitely isn't.

Do we want to invest the effort here exploring this? Measuring something real means making changes to the stack and trying them out.

The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios.

I'm not convinced it'll help for TLS as much because the data needs to be transformed...

geoffkizer commented 3 years ago

The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios.

I agree

I'm not convinced it'll help for TLS as much because the data needs to be transformed...

I agree

Do we want to invest the effort here exploring this?

It would certainly be good to get more data here.

Edit: In particular, we've called out that nonencrypted, chunked responses with large chunks are a scenario that could benefit.

My suggestion would be to do a real WRK (or similar) based throughput test across machines (no loopback) for this scenario.

The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways:

(1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk (2) Single buffered write, where the buffer is constructed each time (i.e. copy (a)+(b)+(c) into a single buffer and send) (3) Single gathered write where we pass (a)/(b)/(c) as the list of gathered buffers

Then run variations with different chunk sizes.

scalablecory commented 3 years ago

Lots to unpack here 😄.

Just to keep me from getting too myopic and only thinking of HTTP, I'll start with a goal and a non-goal for this API:

Stream's lack of guarantees around efficiency make it difficult to use when you have multiple buffers you want to write.

And the non-goal: gathering is not an alternative to buffering. It would only replace buffering on a per-Write basis, not as a general strategy. As demonstrated, it actually works in conjunction with buffering by further reducing I/O count in the scenario where you'd need to flush immediately after overflowing your buffer. It won't stop a user's HttpContent from inefficiently making a bunch of small writes to a Stream, for instance.

And specifically for HttpClient/others:

We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce.

Either way, gathering would be able to help. If we decide we don't want additional memory usage and don't buffer, it'll help. If we decide we do want to buffer, it will help us extract better perf out of that buffer, or may lead to us using a smaller buffer size than we otherwise would.

At least for SChannel, when considering gathered VS perfectly buffered, I don't think gathering will be a major win (read: reduced I/O counts) beyond the generic benefits described above.

The little test app I wrote doesn't require localhost; I'll run it on my network to see how it affects things.

scalablecory commented 3 years ago

The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways:

(1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk (2) Single buffered write, where the buffer is constructed each time (i.e. copy (a)+(b)+(c) into a single buffer and send) (3) Single gathered write where we pass (a)/(b)/(c) as the list of gathered buffers

These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at?

davidfowl commented 3 years ago

We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce.

I'm spending lots of my 6.0 dev time reducing this :smile:. Just don't stash stuff on the types, rent them just in time. Though for reads that's difficult. I suggest you employ the zero-byte read tactic as well. I haven't looked at YARP memory usage as yet.

geoffkizer commented 3 years ago

These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at?

I agree; I basically stole them from your code :) In general I think the scenarios you implemented are fine. My concerns are more about how the code runs.

Specifically: (1) It's basically measuring linear elapsed time instead of overall throughput (i.e. maxing CPU with many concurrent requests). The latter is a better measure of what we really care about. And it can sometimes hit issues like lock contention or cache problems that you wouldn't see in a simple linear test. (2) It reports the best run of a set of runs. I'd rather see an average reported. I saw a fair amount of variance in my runs, and I'm guessing it may have something to do with this. (3) It would be good to run remotely instead of over loopback. The amount of work happening in these tests is pretty small overall and I'd expect the actual cost of networking IO to play an important role in overall throughput. Loopback is highly optimized to just shuffle bits around and not actually construct IP packets etc.

We have a bunch of server perf tests that run in crank using WRK, and we have a bunch of infrastructure around running these automatically and getting various measurements from them etc. That's why I suggested doing something similar here.

geoffkizer commented 3 years ago

Stream's lack of guarantees around efficiency make it difficult to use when you have multiple buffers you want to write.

Oh, I completely agree here. You have a good list of problems. I'd like to do better here, but I'm not sure how.

Gathered writes avoids all of this by giving every layer full control over the entire write rather than just a piece of it.

This is where I start to disagree. Gathered writes help, but I'm not sure they really help all that much. I think it makes a few specific scenarios somewhat better. Scenarios like the one discussed above -- chunked encoding with large chunks and no SSL. It doesn't help with Content-Length encoding, it doesn't help with SSL, it doesn't help with HTTP2 or HTTP3, etc.

That said, it would be good to get better data about all of these scenarios so we can quantify the potential benefits here.

Ultimately this comes down to a prioritization exercise. How much benefit vs how much cost?

scalablecory commented 3 years ago

(1) It's basically measuring linear elapsed time instead of overall throughput (i.e. maxing CPU with many concurrent requests). The latter is a better measure of what we really care about. And it can sometimes hit issues like lock contention or cache problems that you wouldn't see in a simple linear test. (2) It reports the best run of a set of runs. I'd rather see an average reported. I saw a fair amount of variance in my runs, and I'm guessing it may have something to do with this. (3) It would be good to run remotely instead of over loopback. The amount of work happening in these tests is pretty small overall and I'd expect the actual cost of networking IO to play an important role in overall throughput. Loopback is highly optimized to just shuffle bits around and not actually construct IP packets etc.

Hrm that's strange; it maxes out a core 100% on mine. Also it can run over network just fine; you can call it with IP to bind to.