cloudflare / quiche

🥧 Savoury implementation of the QUIC transport protocol and HTTP/3
https://docs.quic.tech/quiche/
BSD 2-Clause "Simplified" License
9.4k stars 709 forks source link

Quiche-server can't handle more than one connection #1554

Open DerLary opened 1 year ago

DerLary commented 1 year ago

Hello,

I am currently working on my Bachelor's Thesis, and we plan to use the quiche client and server application to test QUIC stacks. When opening a single connection from the client to the server, everything works fine. However, when opening multiple connections simultaneously to the server, only one request gets completed correctly, while the others seem to be starved.

Specifically, we have a testbed of machines connected to each other via Gigabit Ethernet. We use one machine to start the server and another machine to start the client application. On the server machine, I have placed two files (file1 and file2) of the same size, 100MB each, which are then requested by the clients. In most cases, only one request is completed correctly, while the others end with a timeout.

Steps to reproduce:

  1. Place two 100MB files on the server (file1 and file2).
  2. Start the server: /home/test/quiche/target/debug/quiche-server --cert /home/test/quiche/apps/src/bin/cert.crt --key /home/test/quiche/apps/src/bin/cert.key --listen 0.0.0.0:4000 --root /home/test/server_files/ --cc-algorithm cubic
  3. Start the clients: /home/test/quiche/target/debug/quiche-client --cc-algorithm cubic --no-verify https://137.226.59.170:4000/file1 --dump-responses /home/test/req and the same command with file2 for another client on the same machine. Note: The clients are started manually directly one after another. After approximately 8 seconds, one of the clients will terminate correctly (not necessarily the one that was started first), while the other terminates after an additional 30 seconds with the error: [ERROR quiche_apps::common] connection timed out after 37.510056927s and only completed 0/1 requests. When inspecting the requested files, it can be seen that only about 10% of the timed-out connection was received, specifically 12,872,599 bytes out of 104,857,600 bytes in this exact run. The same error occurs when starting the clients from different machines to the same server. I have attached the sqlog file for both the client and server sides. The server-096b5... corresponds to the correct client request of client-e633c..., and the server-fb153... corresponds to the timed-out client-fc2582. sqlogs.zip

Visualization

Additionally, we have modified the quiche stack to include the Spin Bit logic (spinning the Spin Bit once every RTT to measure the cwnd). Using this modification, we conducted tests by emulating a bandwidth of 50 Mbps, 5ms RTT, and using a Droptail queue with a size equal to the BDP. We placed a classifier in the middle to track the cwnd and compare the results with the values in the logging file. The following graph shows the cwnd in bytes per RTT (one Spin Bit cycle) quiche_NOT_successfull.pdf . Here we can also observe that one connection starves until it gets terminated.

Could there be any flaws in handling multiple connections in /quiche/apps/src/bin/quiche-server.rs, such as when the break statement is used within the for loop over all clients, potentially leaving the loop without considering the remaining clients?

Thank you in advance for your assistance.

Best regards, Lars

LPardue commented 1 year ago

Hi @DerLary . Thanks for the detailed issue report, this really helps.

Could there be any flaws in handling multiple connections in /quiche/apps/src/bin/quiche-server.rs, such as when the break statement is used within the for loop over all clients, potentially leaving the loop without considering the remaining clients?

In a nutshell, yes there is an oversight leading to a few places that break the loop over all clients, where actually it should continue. A straight swap might work but I suspect a little more refactoring and testing is required.

icing commented 1 year ago

After some reported weirdness, I revisited testing in curl and can see that quiche is not able to handle more than one connection at a time in curl as well.

I have a test that does 200 transfers over 2 connections and see ~40 of them stalling. Also, quiche seems to deliver poll events already handled again after a short while, e.g. we get a HEADERS for a stream that has already been FINISHED.

LPardue commented 1 year ago

@icing this original report is about the example quiche-server application, while https://github.com/curl/curl/issues/11449 is about cloudflare-quic.com which uses an entirely different driving application based on nginx. Seems like an independent matter to me.

icing commented 1 year ago

@LPardue I can open a new ticket if this one is indeed restricted to quiche-server. The curl issue I put on hold until we have a clean running curl+quiche against a local nghttpx with 2 connections.

Update: found the bug in my code, sorry for the noise.😬

LPardue commented 1 year ago

Yeah that sounds good. Happy to help debug the curl one too but let's avoid crossed wires

Jazkiel commented 5 months ago

Hello, just wondering if this issue was fixed. I tried it recently and ran into the same issue as the thread mentioned

LPardue commented 5 months ago

My previous comments still apply, some refactory of quiche-server would need to be done

Jazkiel commented 5 months ago

Even if I were to replace the break with continue, quiche can handle multiple clients but it iterates over a list of clients, shouldn't it serve each client in parallel?

LPardue commented 5 months ago

quiche-server is predominantly a demo app for the API. Scalability is best achieved by applications that can use it as an example and integrate with whatever runtime and processing model they prefer.