Open geoffkizer opened 6 years ago
cc @stephentoub
cc @davidfowl -- I can imagine this might be interesting for you as well.
Oh nice! Yes I like this. Should we also consider adding overloads for ReadOnlySequence\<byte>?
This could work for Sockets cross-plat, and on Linux/Mac for file I/O. On Windows we could emulate it trivially for files. There would be a long period of writing optimized methods for SslStream, compression, etc., but that could be done piecemeal.
Other considerations:
IReadOnlyList<ReadOnlyMemory>
for efficiency, but might consider adding convenience methods for ReadOnlySequence
?Sequence
for symmetry?We may need these for efficient HTTP/3 if QUIC implementation doesn't implement one of nagle/cork. We should coordinate QuicStream
additions with this proposal. CC @jkotalik
Following up re: HTTP/3. We ended up implementing this in our QuicStream
, as it does not support nagle/cork. We used ROM<ROM<byte>>
but IReadOnlyList<ROM<byte>>
would work just as well.
It allows us to gather headers, framing, and user data in a single I/O:
We would use this in HTTP/1 and HTTP/2 in a similar way.
Having gathered write will allow to optimize the SSL Record size in SslStream, currently if you have several buffers and use separate Write calls it will create separate records adding more overhead
See https://hpbn.co/transport-layer-security-tls/#optimize-tls-record-size
We have defined some gather write operations for QUIC on the QuicStream class. Ideally, these would not be QUIC-specific; i.e. we'd add them to Stream and then implement them in QuicStream.
@carlossanlop any questions here or can we move it to API review?
I would like to see a prototype of how these will be consumed (both read and write) and perf numbers to demonstrate its meaningful enough to warrant adding more virtuals to Sream. Where are we going to use the scattered read methods? How much does the gathered write support improve http/2 throughput? How are callers going to avoid allocating IReadOnlyList instances?
I'm not against adding these. But we should take any new virtuals on Stream seriously and do our due diligence.
How are callers going to avoid allocating IReadOnlyList instances?
This in particular is something I was worried about. Also kestrel would use this for writes but not reads.
@tmds 's custom transport also wanted to use these.
Why not something directly on socket rather than virtuals ok stream?
How are callers going to avoid allocating IReadOnlyList instances?
The idea is you use the same List/array for every send/receive. So you only need one allocation per socket, which doesn't seem like an issue.
Why not something directly on socket rather than virtuals ok stream?
Socket already supports this. https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketTaskExtensions.cs#L38
The point of this would be to support it in non-Socket cases, like SslStream or QuicStream.
How much does the gathered write support improve http/2 throughput?
This is a hard question to answer authoritatively, because there's always a way to avoid using scatter/gather by simply coalescing buffers. That said, coalescing has some cost itself, which can vary based on a variety of factors; and also it's a bit of a pain to do.
@scalablecory said above:
It allows us to gather headers, framing, and user data in a single I/O [for HTTP3]:
And that's a nice convenience, but it's not clear to me that gathered writes on QuicStream are necessary to achieve this. We already achieve this in HTTP1.1/HTTP2 by managing our buffers to specifically enable this.
I prototyped this in HTTP/1 and I saw a measurable but very small impact. I don't remember the exact numbers but it was less than 1-2%.
With HTTP/3, I use it to avoid buffering every data frame when content length isn't known, just to prepend a handful of bytes. I haven't bothered to measure in this case because it was such an obvious and easy win.
I prototyped this in HTTP/1
For chunked encoding, I'm guessing?
It does seem like chunked encoding and HTTP3 framing when length is unknown are pretty much ideal use cases for this. It avoids a potentially large buffer copy.
I prototyped this in HTTP/1
For chunked encoding, I'm guessing?
I believe I was also deferring sending headers buffer until sending content when possible, to send them out in one go. There were a few opportunities.
(With nodelay on, this can result in 1 packet instead of 2 for some requests)
The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs.
When implementing a protocol using System.IO.Pipelines
, you anyway end up copying because there is no way to pass Memory
from one Pipe
to another.
The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs.
This is exactly why I was looking for concrete perf data.
Socket already supports this
To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types), so either it would need to access new internal methods, or Socket would need new public ones.
To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types)
True. It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way.
It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way.
The Windows path is actually fairly easy (I did a minimal version of this for my prototyping), but the Linux PAL is scary huge.
There's a lot of code in the Linux PAL stuff, but it should be relatively straightforward.
Confirmed this morning that we can implement this in our SChannel code. This would give us more efficient TLS, as each SslStream.Write() needs to write not just data but an envelope.
Numbers?
I think we should do this and get some numbers to back it up. We'd need a basic minimal implementation in order to compare apples to apples. Here's the logic that SSL Stream uses to write TLS record frames (which have a max size of 16K) to the wrapped Stream.
This could change into logic that stored a list of 16K buffers and passed it to the next layer (which should be more efficient).
I think we should do this and get some numbers to back it up.
+1
This could change into logic that stored a list of 16K buffers and passed it to the next layer (which should be more efficient).
"More efficient" in what sense? I'm trying to understand what we are proving out here.
"More efficient" in what sense? I'm trying to understand what we are proving out here.
2 ways of the top of my head:
I don't think we would add the APIs if it didn't have any performance improvements, its just that we need to actually add them to measure.
I understand the potential benefits here. I'm just not sure what specific experiment you are proposing here, and how it would demonstrate these. What's the A vs B here?
I understand the potential benefits here. I'm just not sure what specific experiment you are proposing here, and how it would demonstrate these. What's the A vs B here?
I'm working on proving this out right now.
The scenarios I have in mind for writes are:
And for reads:
The metrics I'm gathering are:
From prior experience, I can say that in C++ using scatter/gather is a very easy win. The main concern I have here is how much overhead our current PALs are adding, there might be some vague line where copying is cheaper than managing multiple buffers.
It's also easy to say "memory bandwidth is cheap so lets just not complicate the API". We should certainly weight that cost, though my opinion is generally that -- .NET being a cloud-native platform -- we should try to avoid leaving perf on the table when it comes to I/O.
For the proposed API, there is no indication to the user how many buffers makes sense to provide for reading.
System.IO.Pipelines
has some support for multi-buffer reading (as ReadResult), and writing (via PipeWriter). It does come at the cost of copying from pipe to pipe.
If you care mostly about saving syscalls for Socket
, and less about there being copies, NetworkStream
can handle this without extending Stream
API.
On a recvmsg
it can provide some additional buffers. Next calls to Read
from the NetworkStream
can deplete these buffers before making a new syscall.
For writing
, NetworkStream
can act as a BufferedStream
backed by multiple buffers. On Flush
any buffered data gets sent.
These APIs IMO makes more sense for writing than for reading. The servers doesn't use NetworkStream so only adding it there won't benefit the server stack. There are more APIs that support gathering IO for writes (IIS and HTTP.sys for example support this for writing). It's needed at the Stream layer because of things like SSLStream that wrap another stream and need to chunk up writes into 16K pieces before writing to the underlying stream.
This would also be added to the PipeWriter abstraction so that PipeWriterStream could take advantage of it (without copying). We could pass the segment list to the underlying stream in a single write. The DefaultPipeWriter would of course need to copy but that's besides the point (it also copies today for single buffer writes).
To hammer home that this isn't just a NetworkStream concern:
Okay, I'm done benchmarking on Windows for now. I can go further and create more scenarios if needed.
I tested four different I/O strategies:
SocketsHttpHandler
does today with request content. Headers and enveloping are write-combined when possible but request content is not.Some takeaways:
POST with two HttpContent Writes():
I/O strategy | Number of I/Os | Speed |
---|---|---|
Buffered + Gathered | 1 | 136,502 (+464%) |
Buffered | 2 | 76,269 (+215%) |
Unbuffered + Gathered | 2 | 74,539 (+208%) |
Unbuffered | 7 | 24,172 (+0%) |
POST with one HttpContent Write():
I/O strategy | Number of I/Os | Speed |
---|---|---|
Buffered | 1 | 160,789 (+370%) |
Unbuffered + Gathered | 1 | 152,958 (+347%) |
Buffered + Gathered | 1 | 150,085 (+339%) |
Unbuffered | 5 | 34,183 (+0%) |
Both of these also provide a meaningful approximation of coalescing writes in HTTP/2 vs buffering. Beyond avoiding copying everything, this will provide some scalability benefit by improving concurrency between streams.
edit: updated with TCP_NODELAY turned on, which cleaned up some numbers but didn't meaningfully change the results.
For the proposed API, there is no indication to the user how many buffers makes sense to provide for reading.
I don't expect many people to use the scattered read API -- it's a lot less generally applicable than gathered writes. It will mostly be beneficial for segmented buffering strategies, especially ones that recycle buffers.
This is the buffering strategy I'm using for HTTP/2 in LLHTTP and I think Pipelines (@davidfowl correct me if I'm wrong) does the same thing.
You have three segments that make up your read buffer, and the 1st segment is partially filled:
▓▓▓▓░ ░░░░░ ░░░░░
You can make a read to fill in what is left of the 1st buffer, but in that case you're incurring the cost of an I/O for what might be a small number of bytes. And then two more I/O for the the subsequent buffers.
OR
You can do a scattered read to fill in all of that last chunk AND the remaining free buffers in a single I/O.
I've also used this when filling a single-array ring buffer -- you can read into the back half and the front half in a single I/O.
That's a nice transparent update that could be made when using the StreamPipeReader
today if we added the read overload.
The other thing I've come to notice while investigating concurrent connection memory is the fact that each layer of the stack tries to buffer to avoid doing multiple writes when they can. For example, when you do a websocket send, we allocate a buffer internally, copy the user buffer into it, and add framing before the underlying send:
This is expensive for larger buffers but we don't change logic here based on how big the payload is.
This pattern also happens with TLS, we internally allocate a buffer that's payload + frame header and copy the user buffer again into it and write to the underlying Stream
:
Kestrel isn't absolved in this, it also copies that buffers written by SSL Stream into it's internal buffers and then writes a big batch to the socket.
The Socket APIs then internally copy from the user mode buffer to the kernel buffer. We have on number of sys calls here by using a single call to write with many buffers, but we still pay the price of the copy.
I was sending 1MB payloads and the time spend in memcpy is astonishing 😄 .
I think this API could have a big impact in networking scenarios like this where we can remove internal buffers feel better preserving the N buffers for as long as possible until we hit a part of the stack that has to buffer or merge or transform.
I was sending 1MB payloads and the time spend in memcpy is astonishing
Got data?
I'm trying to figure out what conclusions to draw from this data.
An obvious one is, fewer IOs are better.
But when performing the same # of IOs, buffering seems faster than gathering (either buffered + gathered or unbuffered + gathered).
Doesn't this imply we should just do buffering?
BTW, I ran the "two HttpContent Writes" scenario with an 8K write buffer (instead of 4K). Here's what I got:
I/O strategy | Number of I/Os | Speed |
---|---|---|
Buffered | 1 | 114,048 |
Buffered + Gathered | 1 | 104,672 |
Unbuffered + Gathered | 2 | 57,353 |
Unbuffered | 7 | 18,398 |
Again, # of IOs is most important; again, Buffered beats both Gathered strategies.
What's also interesting here, though, is that Unbuffered + Gathered suffers hugely because it now has to perform two separate writes. And that's because gathering does not allow coalescing across user writes, whereas buffering does.
You don't want to buffer at all of the layers because that's wasted CPU cycles copying. Especially for larger buffers, it's wasteful. The APIs we expose today require the caller to provide the buffer, so we need to do something with it, and the options are to copy or to write directly (when there's no transform of the data itself required). I can see 2 approaches:
As an example of this, consider a chunked encoding write implementation. WriteAsync wants to wrap the user provided data with a header and a footer:
If data is small enough (say < 2K), then it makes sense to copy that into an internal buffer before sending. But lets say the user wrote 65K (large file transfer or something). Then doing multiple writes or a single write with multiple buffers likely make more sense.
BTW, the multi-buffer writes is just a way of pushing that eventual copy further down the stack. It doesn't stop the eventual copy, just the intermediate ones.
@geoffkizer with those numbers did you max out the CPU? You should also try it to with multiple buffer sizes to see where the tipping point is.
@geoffkizer with those numbers did you max out the CPU?
No, I'm simply taking Cory's test from above and modifying it slightly.
I do agree it would be better to do a real throughput test here and max out CPU, but that's not what the code does currently.
Based on what I said above, I think there are some scenarios where this is clearly going to be a win. We need to find the right heuristic though, because doing it all the time definitely isn't.
Do we want to invest the effort here exploring this? Measuring something real means making changes to the stack and trying them out.
The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios.
I'm not convinced it'll help for TLS as much because the data needs to be transformed...
The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios.
I agree
I'm not convinced it'll help for TLS as much because the data needs to be transformed...
I agree
Do we want to invest the effort here exploring this?
It would certainly be good to get more data here.
Edit: In particular, we've called out that nonencrypted, chunked responses with large chunks are a scenario that could benefit.
My suggestion would be to do a real WRK (or similar) based throughput test across machines (no loopback) for this scenario.
The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways:
(1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk (2) Single buffered write, where the buffer is constructed each time (i.e. copy (a)+(b)+(c) into a single buffer and send) (3) Single gathered write where we pass (a)/(b)/(c) as the list of gathered buffers
Then run variations with different chunk sizes.
Lots to unpack here 😄.
Just to keep me from getting too myopic and only thinking of HTTP, I'll start with a goal and a non-goal for this API:
Stream
's lack of guarantees around efficiency make it difficult to use when you have multiple buffers you want to write.
SslStream's
16KB SChannel buffer, HTTP/2 framing, TLS framing, etc. -- it is very easy to increase buffer sizes until you find one that works "well", but it's very challenging to find an actually optimal size.And the non-goal: gathering is not an alternative to buffering. It would only replace buffering on a per-Write basis, not as a general strategy. As demonstrated, it actually works in conjunction with buffering by further reducing I/O count in the scenario where you'd need to flush immediately after overflowing your buffer. It won't stop a user's HttpContent
from inefficiently making a bunch of small writes to a Stream
, for instance.
And specifically for HttpClient
/others:
We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce.
Either way, gathering would be able to help. If we decide we don't want additional memory usage and don't buffer, it'll help. If we decide we do want to buffer, it will help us extract better perf out of that buffer, or may lead to us using a smaller buffer size than we otherwise would.
At least for SChannel, when considering gathered VS perfectly buffered, I don't think gathering will be a major win (read: reduced I/O counts) beyond the generic benefits described above.
The little test app I wrote doesn't require localhost; I'll run it on my network to see how it affects things.
The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways:
(1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk (2) Single buffered write, where the buffer is constructed each time (i.e. copy (a)+(b)+(c) into a single buffer and send) (3) Single gathered write where we pass (a)/(b)/(c) as the list of gathered buffers
These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at?
We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce.
I'm spending lots of my 6.0 dev time reducing this :smile:. Just don't stash stuff on the types, rent them just in time. Though for reads that's difficult. I suggest you employ the zero-byte read tactic as well. I haven't looked at YARP memory usage as yet.
These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at?
I agree; I basically stole them from your code :) In general I think the scenarios you implemented are fine. My concerns are more about how the code runs.
Specifically: (1) It's basically measuring linear elapsed time instead of overall throughput (i.e. maxing CPU with many concurrent requests). The latter is a better measure of what we really care about. And it can sometimes hit issues like lock contention or cache problems that you wouldn't see in a simple linear test. (2) It reports the best run of a set of runs. I'd rather see an average reported. I saw a fair amount of variance in my runs, and I'm guessing it may have something to do with this. (3) It would be good to run remotely instead of over loopback. The amount of work happening in these tests is pretty small overall and I'd expect the actual cost of networking IO to play an important role in overall throughput. Loopback is highly optimized to just shuffle bits around and not actually construct IP packets etc.
We have a bunch of server perf tests that run in crank using WRK, and we have a bunch of infrastructure around running these automatically and getting various measurements from them etc. That's why I suggested doing something similar here.
Stream's lack of guarantees around efficiency make it difficult to use when you have multiple buffers you want to write.
Oh, I completely agree here. You have a good list of problems. I'd like to do better here, but I'm not sure how.
Gathered writes avoids all of this by giving every layer full control over the entire write rather than just a piece of it.
This is where I start to disagree. Gathered writes help, but I'm not sure they really help all that much. I think it makes a few specific scenarios somewhat better. Scenarios like the one discussed above -- chunked encoding with large chunks and no SSL. It doesn't help with Content-Length encoding, it doesn't help with SSL, it doesn't help with HTTP2 or HTTP3, etc.
That said, it would be good to get better data about all of these scenarios so we can quantify the potential benefits here.
Ultimately this comes down to a prioritization exercise. How much benefit vs how much cost?
(1) It's basically measuring linear elapsed time instead of overall throughput (i.e. maxing CPU with many concurrent requests). The latter is a better measure of what we really care about. And it can sometimes hit issues like lock contention or cache problems that you wouldn't see in a simple linear test. (2) It reports the best run of a set of runs. I'd rather see an average reported. I saw a fair amount of variance in my runs, and I'm guessing it may have something to do with this. (3) It would be good to run remotely instead of over loopback. The amount of work happening in these tests is pretty small overall and I'd expect the actual cost of networking IO to play an important role in overall throughput. Loopback is highly optimized to just shuffle bits around and not actually construct IP packets etc.
Hrm that's strange; it maxes out a core 100% on mine. Also it can run over network just fine; you can call it with IP to bind to.
The Socket class has this already, in the form of Send/Receive operations that take an
IList<ArraySegment<byte>>
. This isn't exposed up through Stream classes, however.One specific scenario where this would be useful is writing chunked-encoded request bodies in SocketsHttpHandler. When the user calls the request Stream's Write[Async] method, we need to first write a chunk header with the chunk length followed by \r\n, then write the chunk contents from the user's buffer to the underlying stream (which is either SslStream or NetworkStream). This requires us to either do two writes (which means two kernel calls and means the chunk header will likely be sent in a separate packet), or to copy into a local buffer so we can do a single write. It would be more performant (and simpler) to do a gather write where we pass two buffers, the first containing the chunk header, the second containing the chunk data, passed directly through.
There are various examples in HTTP2 where scatter/gather IO would be useful as well. (Mostly gather, since gather helps deal with the "framing" scenario above. Scatter reads are less useful since you don't control the framing yourself.)
Proposed API:
The base Stream class would provide default implementations on top of existing Read/Write APIs.