cloudflare / quiche

🥧 Savoury implementation of the QUIC transport protocol and HTTP/3
https://docs.quic.tech/quiche/
BSD 2-Clause "Simplified" License
9.4k stars 709 forks source link

QUIC (curl+quiche) unable to match performance of TCP in simulated network #1400

Open ygina opened 1 year ago

ygina commented 1 year ago

I am running some experiments comparing the performance of QUIC vs TCP in a simulated network, and wondering why QUIC is unable to match the performance of TCP, or at least be much closer. I am using curl compiled with quiche and sending either an HTTP/3 or HTTP/1.1 POST request with variable size payload (x-axis) to a webserver with one hop in between (Fig. 1). The near subpath has 100Mbps bandwidth and 2ms RTT, the far subpath has 10Mbps and 150ms RTT, and the egress queue at each host has a maximum size of 140 packets. The queue size was calculated to allow at least 1 BDP (152 milliseconds) (10 megabits per second) / (1500 bytes) = 0.152 10000000 / 1500 / 8 = about 127.

The optimal tput based on the bottleneck link is 10Mbps = 1.25MB/s. TCP achieves this but QUIC is far below. I am trying to figure out why. I feel this difference is too large to explain by userspace overheads.

I am observing that QUIC sets a very conservative cwnd compared to TCP (Fig. 2). I plotted this graph by logging each time the cwnd changed in quiche code for a single 20MB transfer, and a TCP iperf3 connection for the equivalent amount of time. The optimal tput should be ~1 BDP = (127 packets) (1500 bytes) = 127 1500 / 1000 = ~200 kBytes. The TCP cwnd hovers above and the QUIC cwnd much below.

figure1 figure2 **_Fig. 1 HTTP POST request tput (left) and Fig. 2 cwnd for a single 20M request (right)_**

Previously, I have also observed that while TCP pretty much always sends MTU-sized packets (~1500 bytes), QUIC often sends a little less. Maybe since the queue limit is based on number of packets rather than bytes this is artificially reducing the cwnd? But why wouldn't QUIC just send MTU-sized packets? Would love to hear any thoughts.

ygina commented 1 year ago

In these experiments, I also increased the MTU from 1200 to 1460 in quiche to be comparable to TCP.

LPardue commented 1 year ago

Do you see similar behaviour when using the quiche-client application for an HTTP/3 upload?

LPardue commented 1 year ago

In these experiments, I also increased the MTU from 1200 to 1460 in quiche to be comparable to TCP.

Do you mean you changed curl? https://github.com/curl/curl/blob/fc8ad0b23c9fd922ebdb89348d467df037da1775/lib/vquic/curl_quiche.c#LL423C21-L423C21

junhochoi commented 1 year ago

@ygina can you describe your test environment a little more?

ygina commented 1 year ago

Do you mean you changed curl?

Yup I added a line in curl here quiche_config_set_max_send_udp_payload_size(qs->cfg, 1460).

can you describe your test environment a little more?

http { server {

Enable QUIC and HTTP/3.

    listen 443 quic reuseport;

    # Enable HTTP/2 (optional).
    listen 443 ssl http2;

    ssl_certificate      <cert.pem>;
    ssl_certificate_key  <key.pem>;

    # Enable all TLS versions (TLSv1.3 is required for QUIC).
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;

    # Add Alt-Svc header to negotiate HTTP/3.
    add_header alt-svc 'h3=":443"; ma=86400';

    location = "/" {
        client_max_body_size 100M;
        client_body_buffer_size 100M;
        client_body_in_single_buffer on;
        client_body_temp_path /tmp/client_body_temp;

        proxy_pass http://127.0.0.1:8000;
    }
}

}


* OS is 22.04.1 LTS (Jammy Jellyfish), but running in mininet as described above.

> any noticeable difference between quiche default mtu, or 1460 mtu?

<div align="center">
<img width="400" alt="image" src="https://user-images.githubusercontent.com/7017404/215954783-ccba7f98-4bb1-4705-8ddb-f6e3516a11cc.png">
</div>

This is only one trial but it looks like 1460 mtu is slightly better than default 1200.
junhochoi commented 1 year ago

Do you mean you changed curl?

Yup I added a line in curl here quiche_config_set_max_send_udp_payload_size(qs->cfg, 1460).

I guess you also need to update what @LPardue pointed in curl in addition to your change, because that code will limit the outgoing packet size, regardless of what you've set using quiche_config_set_max_send_udp_payload_size(). Probably there are other places to look. Trying to upload with quiche-client/quiche-server would be helpful too.

ygina commented 1 year ago

I guess you also need to update what @LPardue pointed in curl in addition to your change, because that code will limit the outgoing packet size, regardless of what you've set using quiche_config_set_max_send_udp_payload_size(). Probably there are other places to look.

@junhochoi Oh yea, I'd changed that too: https://github.com/ygina/curl/blob/2masot-sidecar/lib/vquic/quiche.c#L549. Checked in Wireshark that the maximum packet sizes had increased.

Trying to upload with quiche-client/quiche-server would be helpful too.

It seems like the current quiche-server doesn't actually read the HTTP request payload (https://github.com/cloudflare/quiche/blob/master/apps/src/common.rs#L1555)? But I was able to use quiche-client to POST requests to the nginx server using the --body option (sorry the colors aren't consistent):

image

Wonder what's happening at 10MB there...

LPardue commented 1 year ago

10MB sounds like the example flow-control window size. What values does the server have configured, possibly defaults based on your nginx conf

LPardue commented 1 year ago

@ygina please open a separate issue about POST handling on quiche-server, we should properly consume and drain the stream to help tests like this. Should be a straightforward change.

ygina commented 1 year ago

10MB sounds like the example flow-control window size. What values does the server have configured, possibly defaults based on your nginx conf

I'm not immediately sure what you mean by "example flow-control window size." Mind elaborating?

LPardue commented 1 year ago

I should have been clearer sorry, the description in https://github.com/cloudflare/quiche/tree/master/nginx#http3_initial_max_data should help. Unless you specify something different in your configuration, you'll be adding flow control limit of 10 meg

ygina commented 1 year ago

It's looking a little better! At least, there is no sudden dropoff at 10MB.

image

I'm guessing because otherwise, if we exceed the per-connection flow control limit the server spends resources negotiating with the client over flow control limit increases? What does this process look like? quiche-client is represented as quic-quiche-<http3_initial_max_data>-<http3_initial_max_stream_data>. curl is represented by quic-1460.

And why is it that the default per-stream flow control limit is only 1m? I tried setting this to 100m (is that megabytes?) but the results were worse.

I reran the first experiment using curl and a flow control limit of 100m. Maybe it's a little better but still much worse than tcp:

image
ygina commented 1 year ago

I came across this fastly article describing the tricks they use to get quic's performance on par with tcp/tls. Wondering if it's necessary to say, implement the Delayed Ack extension, or to coalesce packets with GSO.

It could have to do with my setup as well. Have you guys run experiments comparing quiche http/3 to tcp with tls?

LPardue commented 1 year ago

Flow control is intended to constrain local state commitment. Imagine a server that handles thousands of connections, they can't all have access to all of the memory. Flow control exerts back pressure. It's typically linked to how fast you can drain incoming data from the transport and pass it on.

Picking a Flow control size is an application and deployment choice. Being Conservative with defaults avoids nasty surprises.

junhochoi commented 1 year ago

@ygina what's your linux client/server's value of /proc/sys/net/ipv4/tcp_no_metrics_save? if 0, can you try with 1?

ygina commented 1 year ago

@LPardue If I understand the parameter correctly, setting http3_initial_max_data to 100m should be good in terms of space for this single connection setup while not needing to increase the per-connection limit. This change makes the dropoff from the 4th graph in this issue disappear, but quiche-client still settles into a throughput that is much lower than tcp's when the data size is larger. In Wireshark, even though many of the packets start out mtu-sized (1392 bytes here), at the tail end of the connection the packets become much smaller, which does not seem efficient for a large data transfer? In comparison, tcp basically only sends mtu-sized packets.

@junhochoi It was 0 before. I set it to 1, but the tcp line is the same.

LPardue commented 1 year ago

ygina, if you run curl with QLOGDIR= it will generate some qlogs that you can zip up and share here and/or analyse via qvis https://qvis.quictools.info/#/files.

This will help us get an idea exactly what is happening in those packets

ygina commented 1 year ago

Cool! Here's the qlog from a 20M transfer using curl: curl_20M.zip. I'm currently checking out the visualization.

ygina commented 1 year ago

I've been stuck on why quiche sends so many below-mtu sized packets and I narrowed it down to these lines. Essentially, when the remaining available cwnd is smaller than an mtu, quiche decides to send a packet anyway, but the per-packet processing overheads of quic are quite high. I replaced the lines with the following such that quiche waits for the available cwnd to grow so that, if it has enough data to send, it only sends mtu-sized packets (https://github.com/ygina/quiche/commit/3e461b600e2264361e136312cf1a2b9b3d7bc480):

let cwnd_available = self.paths
    .get(send_pid)?
    .recovery
    .cwnd_available()
    .saturating_sub(overhead)
    .saturating_sub(left_before_packing_ack_frame - left); /* bytes consumed by ACK frames */
if cwnd_available < left {
    left = 0;
}

I managed to get quic's performance to get much closer (satisfactorily close for me) to tcp using both curl and quiche-client (see below). Actually, curl outperforms quiche-client now!

image

Is this still to spec? A bug? Or some other reason for why these lines are the way they are currently?

ygina commented 1 year ago

I increased http3_initial_max_stream_data to 50m from 1m and now quic matches tcp. That makes me happy. Strange since that same configuration change hurt performance before the cwnd mtu fix.

junhochoi commented 1 year ago

Looks like that change is introduced in https://github.com/cloudflare/quiche/pull/1385. @ygina can you try with 5a753c25e43e1d7b32df9d8c284fe0cab2b19cb4 (previous commit of this PR)? you'll need to rebuild clients only, not the server. Just curious if it has same problem before it.

ygina commented 1 year ago

@junhochoi Yes, I tested with https://github.com/cloudflare/quiche/commit/5a753c25e43e1d7b32df9d8c284fe0cab2b19cb4 and the performance is the same as quic-quiche-1000m-1m. Probably because the changes in #1385 don't affect non-ACK packets that make up the majority?

ygina commented 1 year ago

I've been stuck on why quiche sends so many below-mtu sized packets and I narrowed it down to these lines. Essentially, when the remaining available cwnd is smaller than an mtu, quiche decides to send a packet anyway, but the per-packet processing overheads of quic are quite high. I replaced the lines with the following such that quiche waits for the available cwnd to grow so that, if it has enough data to send, it only sends mtu-sized packets (ygina@3e461b6):

I believe this is the problem but possible it's not the fix. This will affect anything that has to do with number of packets, such as queue sizes limited by packet counts as opposed to bytes, and congestion control algorithms that operate in terms of packet counts.

LPardue commented 1 year ago

There's an inherent tradeoff between thoughput and latency.

A sender that holds off from emitting a packet in order to either fill an MTU-sized packet, or wait for the CWND to grow (via receiving the peer's ACK) can introduce additional latency. If the sending packet itself contains an ACK for the peer, which it would use to control its CWND, then things get a bit weird.

This all sounds a bit like https://en.wikipedia.org/wiki/Nagle%27s_algorithm

There's possibility that tuning could improve things. But we have to be aware different applications that use QUIC will have different needs to trade off.

keithw commented 1 year ago

(Hi all, I'm working with @ygina on this.)

A sender that holds off from emitting a packet in order to either fill an MTU-sized packet, or wait for the CWND to grow (via receiving the peer's ACK) can introduce additional latency.

I'm a bit confused by this. TCP is often implemented (e.g. in Linux) with a cwnd maintained in MSS-sized increments, so this problem basically doesn't occur. It's true that if a flow is application-limited, you do have to decide whether to "hold off" from emitting a partial segment now vs. waiting for the user to supply more data. This is what Nagle's algorithm is meant to address.

But here the flow isn't application limited; the application is trying to send tens of megabytes that's all queued up already, and waiting for room in the cwnd to transmit. So I think the only decision is about whether to send a lot of small packets (e.g. one small packet now, and then more small packets when some data is ACKed or cwnd grows), vs. sending MTU-sized packets as long as the flow remains non-application-limited. This seems like more of a "silly window syndrome" issue than something related to Nagle's algorithm.

If the sending packet itself contains an ACK for the peer, which it would use to control its CWND, then things get a bit weird.

My understanding had been that in both TCP and QUIC, ACK-only packets can be sent at any time, irrespective of cwnd (and also the case in quiche post-#1385). So I don't think the timing of ACK transmission has to be greatly affected by these issues.

ygina commented 1 year ago
let cwnd_available = self.paths
    .get(send_pid)?
    .recovery
    .cwnd_available()
    .saturating_sub(overhead)
    .saturating_sub(left_before_packing_ack_frame - left); /* bytes consumed by ACK frames */
if cwnd_available < left {
    left = 0;
}

What if we check that if both cwnd_available < left (the remaining cwnd is less than an MTU) and the remaining amount of data to send exceeds cwnd_available, set left = 0? That is, wait for room in the cwnd to transmit if the flow remains non-application-limited. Otherwise set left = cmp::min(cwnd_available, left).

LPardue commented 1 year ago

Hi @keithw, appreciate the comments. Lots of possible things to discuss here :smile: Junho can probaly weigh in too

TCP is often implemented (e.g. in Linux) with a cwnd maintained in MSS-sized increments, so this problem basically doesn't occur.

Does that apply to all congestion control algorithms?

Generally the best way for quiche to work is to send a flight of paced packets. Assuming we are not app limited, we'd ideally use something like GSO to send a full CWND's worth of data, and then wait for the ACKs to come back so we can repeat. Reading a complete set of ACKs before attempting to send again will mean that the entire CWND is available to use.

My understanding had been that in both TCP and QUIC, ACK-only packets can be sent at any time, irrespective of cwnd -- is that right? If so I don't think I see the problem.

QUIC packets that contain only ACKs do not count towards CWND. But there's no requirement for them to be sent that way. Sending small, ack only-packets adds overhead in the form of packet construction, encryption and syscalls. But they could improve responsiveness. Deciding what is a right hit is hard! Some implementations might prefer to bundle the ACKs with other frame types to amortize the costs; techniques like GSO rely on slicing data into same-sized payloads, with optionally one smaller overflow at the end. In a case where there is high rate of data exchange in both directions, batching can be a useful optimization. Sending even more ack-only packets separately is entirely possible, but would have to be weighed against batching. Also it sounds like it wouldn't help in a scenario like yours that is counting packets. But I don't know how common that is.

Unlike TCP, QUIC has multistreaming and in-band control frames. In situations where the timely delivery of something like a RESET_STREAM, MAX_STREAM_DATA frame, or PATH_VALIDATION frame is important, then not sending those things when there is sufficient CWND to do so could be harmful. Mostly that's not a problem if they get batched in with the application data though, they should go out in the first packet in our flight of packets model.

At the application level, app performance can be enhanced by sending data for different concurrent streams even when not app limited and cwnd < mtu. For instance, dispatching HTTP requests ASAP, rather than waiting for cwnd to grow to MTU size.

I'm not saying everything in quiche is perfectly designed. There may be improvements or ways to expose tuning to apps so they can use the connection having made their own tradeoff decisions. So discussions like this are really useful, thanks.

keithw commented 1 year ago

Hi @keithw, appreciate the comments. Lots of possible things to discuss here smile Junho can probaly weigh in too

TCP is often implemented (e.g. in Linux) with a cwnd maintained in MSS-sized increments, so this problem basically doesn't occur.

Does that apply to all congestion control algorithms?

On Linux, yes, that's my understanding. The individual congestion-control algorithms adjust cwnd in different ways, but the logic that interprets cwnd is part of the common tcp.h and tcp_output.c, and it's interpreted as a packet count. See https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_output.c#L2714 and https://github.com/torvalds/linux/blob/master/include/net/tcp.h#L1204-L1228, as well as RFC 5681:

   The RECOMMENDED way to increase cwnd during congestion avoidance is
   to count the number of bytes that have been acknowledged by ACKs for
   new data.  (A drawback of this implementation is that it requires
   maintaining an additional state variable.)  When the number of bytes
   acknowledged reaches cwnd, then cwnd can be incremented by up to SMSS
   bytes.

I do wonder what other QUIC CUBIC implementations do (I haven't looked myself yet).

Generally the best way for quiche to work is to send a flight of paced packets. Assuming we are not app limited, we'd ideally use something like GSO to send a full CWND's worth of data, and then wait for the ACKs to come back so we can repeat. Reading a complete set of ACKs before attempting to send again will mean that the entire CWND is available to use.

Hmm, I'm not sure this is so desirable for a long-running bulk flow. Generally speaking, I think one would expect the ACKs to come back pretty continuously (in TCP, for every 2 segments delivered) as data-carrying segments are delivered to the receiver at the bottleneck link rate. The sender probably should be keeping the window close to full by sending more segments almost as soon as space opens up in the cwnd (i.e., even with a constant cwnd, the sender is transmitting one MTU-sized segment for every MTU-sized segment that gets ACKed). I may have misunderstood you; I don't think you'd want the sender to wait for a "complete set" of ACKs back (letting packets in flight go to zero and the cwnd open up completely?) before attempting to send anything again. The bottleneck buffer would go empty in that situation and you'd be wasting the link for some period of time. I'm sure GSO is useful, but hopefully it's being done on smaller units than "an entire RTT's worth of packets."

In any event, advisable or not, it doesn't seem to be what @ygina was observing in these long-running uploads to nginx -- I think she's seeing quiche sending packets of ~400 bytes or so.

My understanding had been that in both TCP and QUIC, ACK-only packets can be sent at any time, irrespective of cwnd -- is that right? If so I don't think I see the problem.

QUIC packets that contain only ACKs do not count towards CWND. But there's no requirement for them to be sent that way. Sending small, ack only-packets adds overhead in the form of packet construction, encryption and syscalls.

Sure, but... I'm a little confused why this would be a concern in the first place. AFAIK, in the case of a full-throttle flow, the data sender basically never has a need or desire to send a small ACK-only packet:

At the application level, app performance can be enhanced by sending data for different concurrent streams even when not app limited and cwnd < mtu. For instance, dispatching HTTP requests ASAP, rather than waiting for cwnd to grow to MTU size.

I think I'm a little confused about the concern about "waiting for cwnd to grow to MTU size." If you're always sending MTU-sized segments, then room in the window is always going to open up in MTU-sized segments anyway, and it doesn't really matter whether cwnd growth is done gradually or in MTU-sized units. So I don't think you really have to worry that the sender will just be sitting around waiting for the window to open gradually byte-by-byte until it reaches 1 MTU of available room (or for cwnd itself to grow gradually byte-by-byte), even if quiche did things the Linux TCP / RFC 5681 way. (I mean Linux TCP is pretty widely used...)

What do you think is the right way to proceed here? If you think helpful, we could try to submit a pull request to quiche (probably along the lines of @ygina's proposal in https://github.com/cloudflare/quiche/issues/1400#issuecomment-1414520725 ?). The broader context is that we're trying to write a paper that includes measurements of QUIC CUBIC (represented by quiche) and TCP CUBIC (represented by Linux) over the same emulated paths, and we'd like to benchmark each one as fairly as possible. I don't think it makes sense for us to show implausibly bad results for QUIC (because I don't think that's realistic or useful for the reader) but it also doesn't really seem right to make a substantial change locally that we weren't able to persuade the broader quiche community to accept, and then report this as representing "QUIC". :-/

LPardue commented 1 year ago

I may have misunderstood you; I don't think you'd want the sender to wait for a "complete set" of ACKs back (letting packets in flight go to zero and the cwnd open up completely?) before attempting to send anything again. The bottleneck buffer would go empty in that situation and you'd be wasting the link for some period of time. I'm sure GSO is useful, but hopefully it's being done on smaller units than "an entire RTT's worth of packets."

I agree. The approach hinges on batching and pacing. No point blasting a large number of packets and waiting on them. So picking the right number to to send is critical. By offloading the UDP send parts to the kernel, the application is able to read incoming UDP and decide when to prepare the next batch.

I'm a bit rusty on the curl code but I don't believe its upload behaviour is very advanced compared to what is possible.

In any event, advisable or not, it doesn't seem to be what @ygina was observing in these long-running uploads to nginx -- I think she's seeing quiche sending packets of ~400 bytes or so.

Agree. The qlog does seem a bit odd. We see those regular small data-bearing packets. And from the client's perspective, the server acks are getting lagged way behind the packets the client is sending. Not sure if there is some impedance between the event loop happening there. How does the download case behave for a similarly sized file?

There's also batches of STREAM_DATA_BLOCKED with the same limit value, which I thought we'd fixed.

What do you think is the right way to proceed here? If you think helpful, we could try to submit a pull request to quiche (probably along the lines of @ygina's proposal in https://github.com/cloudflare/quiche/issues/1400#issuecomment-1414520725 ?).

I think some more investigation into the matter will help us better decide. There's some clues to look into further.

The broader context is that we're trying to write a paper that includes measurements of QUIC CUBIC (represented by quiche) and TCP CUBIC (represented by Linux) over the same emulated paths, and we'd like to benchmark each one as fairly as possible. I don't think it makes sense for us to show implausibly bad results for QUIC (because I don't think that's realistic or useful for the reader) but it also doesn't really seem right to make a substantial change locally that we weren't able to persuade the broader quiche community to accept, and then report this as representing "QUIC". :-/

Your research sounds interesting! And trying to give fair representation to each sounds great. We want these protocols to perform as close to each other as possible. I think investigating internal and external factors that might be contributing to the observations will make it clearer what to do next. And we can help with that.

mwelzl commented 1 year ago

Having read all this (many thanks for your help, @LPardue !) I wanted to add my 2 cents - as I stumbled over this paragraph and wanted to make sure it doesn't reflect a misunderstanding:

Unlike TCP, QUIC has multistreaming and in-band control frames. In situations where the timely delivery of something like a RESET_STREAM, MAX_STREAM_DATA frame, or PATH_VALIDATION frame is important, then not sending those things when there is sufficient CWND to do so could be harmful. Mostly that's not a problem if they get batched in with the application data though, they should go out in the first packet in our flight of packets model.

Just to be clear - yes these messages should be sent, but our concern here isn't that the payload per packet might fluctuate - @ygina (as was mentioned with the 400 byte packets) really sees the entire packet being small, in Wireshark, not the payload per packet. That just shouldn't happen, I believe, when there's a large enough buffer of data that wants to get out.

LPardue commented 1 year ago

Agree that the case of saturated upload is straightforward but demonstrating some oddness. I'd like to understand the root causes better in order to inform our decision about changes in the library. Since these are controlled conditions, it doesn't feel like an impossible task to figure it out. Appreciate the work you're doing and data that is being shared.

junhochoi commented 1 year ago

It'd be good to have a same trace for downloading same sized file to see if it has a same pattern (small packets) compared to upload tested here. If so, this could be upload-specific. I don't remember quiche has similar issue with download, but I could be wrong.

Ryenum commented 10 months ago

Hello, I encountered some problems when configuring the quic service of NGINX. My configuration is the same as that of the official website, but I still cannot use the quic protocol when accessing the server. The h2 protocol is still used. Here is my compilation information:

server {

Enable QUIC and HTTP/3.

    listen 443 quic reuseport;
    server_name  test.cn;

    # Enable HTTP/2 (optional).
    listen 443 ssl http2;

    ssl_certificate      /usr/local/nginx/conf/cert/test.pem;
    ssl_certificate_key  /usr/local/nginx/conf/cert/test.key;

    # Enable all TLS versions (TLSv1.3 is required for QUIC).
    ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;

    # Add Alt-Svc header to negotiate HTTP/3.
    add_header alt-svc 'h3=":443"; ma=86400';

}

Can you help me?

ygina commented 10 months ago

@Ryenum That looks similar to my configuration above. How are you accessing the server? Are you sure your client is using the quic protocol?

pplabs-fute commented 10 months ago

@ygina

image

Your drawings are very good. How did you draw them?

ygina commented 10 months ago

@pplabs-fute I just use matplotlib. :)