Test connection multiplexing across multiple servers and clients

tegefaulkes commented 1 year ago

Specification

We need to create tests that create an arbitrary number of connections across an arbitrary number of servers and clients.

The ultimate goal is to test that connection multiplexing and that connections function properly when run concurrently.

All the tests should be sending at least 1 stream of random data

Additional context

Related #1

Tasks

[x] 1. Multiple clients connecting to a single server.
[x] 2. Multiple clients sharing a socket.
[x] 3. Clients sharing socket with server.
[x] 4. Random assortment of clients and servers communicating with each other.

tegefaulkes commented 1 year ago

QUICSocket.registerClient is a function that isn't implemented yet. What is it for? Do we need to track what clients are using the socket? I suppose we'd need it if stopping the socket cleans up anything using it such as the server and clients.

Thoughts @CMCDragonkai ?

CMCDragonkai commented 1 year ago

Yes you need to be able to keep track of clients too. You have to track all clients and servers of a given socket...

CMCDragonkai commented 1 year ago

Lifecycle rules too. Can't let a socket be destroyed with clients that exist.

tegefaulkes commented 1 year ago

So far for any tests on the connection level were I need to test a connection to a server I've been putting them in the QUICClient.test.ts test file. This file is getting pretty large, I may want to split it up into files that test certain aspects. Such as the server/client concurrency testing.

CMCDragonkai commented 1 year ago

You can separate it with QUICClient.X.test.ts. Where X are the different variants.

tegefaulkes commented 1 year ago

Destroying a connection might be hanging if no streams were created. It might be test specific, I'll need to dig into this.

tegefaulkes commented 1 year ago

Found a bug. In certain scenarios client streams are not fulling cleaning up when they are meant to be ended. This is similar to the previous one I fixed where streams are not listed as readable or writable after they have finished. This means we are not checking them after the last state change. Previously this was fixed by adding streams to a short list to be checked when destroying.

The problem here however is that we're not entering the destroying state, we're waiting for the readable side to end during normal operation. So the previous fix to the logic doesn't apply here.

I think given this we need to check all active streams indiscriminately. This means an O(n) loop every time we receive data on a connection. This is frustratingly not ideal from a performance perspective but it's still linear time so not the end of the world.

tegefaulkes commented 1 year ago

Notably, the examples don't maintain state for the streams. but it does for the connections. Every time the socket receives a packet, it handles the packet and processes any events on the streams, readable and writable.

After processing all incoming packets, it process all clients for outgoing packets, then checks all clients again if they have closed so it can clean them up.

I think that means the only solution here is to check every active stream for a state change after we process a recv for that connection. but we still need to check the readable list for new streams.

The available methods here for processing events are fine for the example since it processes reads and write statelessly. if the stream has ended then we're just not processing them as part of the readable or writable list.

Ultimately the solution here is to process the readable list so we can create state for new streams. We can then iterate over all the existing streams checking if they're writable and process any state changes.

There is an unknown here. If the stream has ended with a shutdown before any data was sent to create remote state. Is that stream considered readable? Would state for it be created? Do we care if it isn't?

Anyway, changes to be made include...

for a recv call for a connection, we need to iterate over all the readables to create state for the streams, process any recv stream data.
after that we need to iterate over over all existing connections processing the isWritable(), check the status of the readable and writable state and trigger any state change for the QUICStream.
I need to double check some expected logic. What does the stream state look like for a stream that was destroyed before any data was sent. That would look like a STREAM_RESET event for the readable. Would that stream be readable at all? would we create state for it? would there be anything to process?

tegefaulkes commented 1 year ago

I'm just confused now. I've made changes so that after every receive and send event we check the state for every existing stream. So we shouldn't be missing any state changes now.

I can see one side sending the fin frame in all cases. But some times for no reason I can tell, the receiving side never sees this finish frame? weather this happens at all can be very random.

Assuming all packets are being properly sent and processed then I have no idea why this is still happening.

tegefaulkes commented 1 year ago

I'm going to take a step back from this for a little bit and focus on some of the other issues. Some solutions may come while I mull it over.

tegefaulkes commented 1 year ago

This sounds pretty similar to my problem.

https://github.com/cloudflare/quiche/issues/740

tegefaulkes commented 1 year ago

Posted an upstream issue https://github.com/cloudflare/quiche/issues/1507

tegefaulkes commented 1 year ago

Things to try:

Log out isFinished and error message for the stream. Indicate clearly that the stream is not finishing.
try to serialise the streams starting and ending in stages and isolate if its a concurrency problem.

CMCDragonkai commented 1 year ago

Here's some ideas as to how to debug this.

We need to use stream_finished to confirm to us whether the stream is in fact finished or not. Here's an example:

      // THIS WILL BE TRUE IF fin or RESET_STREAM
      // Either graceful or errored
      console.log(
        'STREAM RECV GRACEFUL OR ERROR',
        this.streamId,
        this.conn.streamFinished(this.streamId)
      );

We should do this after calling this.conn.streamRecv(this.streamId, buf);.

This just confirms to us even if we get an exception, whether the stream is really finished.

Next thing is to do a more controlled test.

Setup your 2 servers with many clients. Maybe reduce this to lots of clients to 1 server.

Create a bunch of streams. Then sequentially start closing the streams. On each step, check if the stream is finished on the other side. Then also consider doing it in batches, like 2 by 2, 3 by 3, 50 by 50... so on.

We need to ensure that it's not our code that's causing the problem.

CMCDragonkai commented 1 year ago

1 client with a lot of streams to 1 server.

Many clients with each 1 stream to the 1 server.

Many clients with multiple streams to 1 server.

CMCDragonkai commented 1 year ago

Choose only 1 side to close the stream. Just close it on the client side. Reduce the amount of randomness.

@tegefaulkes your current theory is that when the client closes the stream, instead of the server receiving a fin, it ends just up throwing an exception as Done. This is what we want to prove. To do this we must reduce randomness, and just sequentially or batching up the closes on the client until we see a problem.

If client side doesn't resolve in this problem, then do it on the server side.

Scenarios to prove:

1 client, 1 connection, 1 server, X streams
- sequential
- batching
- total

tegefaulkes commented 1 year ago

Found a new oddity. When the problem happens, we're not receiving ALL of the messages but only when the problem happens. Running tests with low congestion does not have this problem.

this is VERY odd. a quiche stream is meant to maintain data integrity. So is the problem happening between the quiche stream and our QUICStream? But this partially explains the problem. If we're loosing random messages then we're randomly loosing our finish message.

I'll dig into this further and post my other observations in a moment.

CMCDragonkai commented 1 year ago

Make sure you're testing with 64 bytes and 64 bytes. There's no equivalence on the number of stream writes and stream reads. Use buffer comparison instead.

tegefaulkes commented 1 year ago

Notes:

For 1 stream with large amount of messages: I could not trigger the problem
For many streams, 1 message:
1. 100 streams, all streams clean up but client fails to clean up?
2. The client connection is pretty consistently not cleaning up at about 100 streams.
3. 200 streams and the server connection doesn't clean up.
20 streams, 20 messages ea:
1. Seems to work fine.
40 streams, 20 messages EA:
1. Looks like i'm triggering it.
We're not actually receiving all messages? How does that work?
We're not actually testing that all data sent is all data received. Not thoroughly.
Confirmed, we're missing bytes on the receiving end, need to check which ones now.

tegefaulkes commented 1 year ago

The missing bytes are always at the end of the messages sent.

I was expecting the message with the fin frame to be empty. but during congestion this is not the case, the final packet can contain part of the message. This is a simple fix, I just wasn't writing the message with the fin frame to the stream.

That may be part of the problem but it doesn't explain the streams not cleaning up. I'm still digging.

tegefaulkes commented 1 year ago

Ok, things are fixed up so that I can see all of the data is being sent and receive.

In one example with the stream closing problem I can see the following.

fin frame is being sent for the stream
the stream receives all 100 expected bytes but not the fin frame.
Other streams have the fin frame sent, processed and stream cleaned up all after the one bad stream. This implies the packet with the frame is sent before these and processed.
During this time there are plenty attempts to read the bad stream, this should've processed the fin frame or stream error in this time.

So after fixing the missing data, we're back to the original problem.

tegefaulkes commented 1 year ago

If I send some data along with the fin frame, in this case !END!, I can't seem to trigger the problem even after doubling the number of streams. I can't really say for certain that this is a fix though.

You mentioned https://github.com/cloudflare/quiche/commit/cc98af06cb1240fb1b1be3557166ca443b1d8653 where they patched the quiche code to allow 0 buffer fin frames to be sent.

It's possible they missed an edge case where they're not sending or processing a fin frame. I think I can verify this pretty easily if I can modify the quiche source code with some debug outputs.

If including data within the fin frame does prevent the problem, then that is a stop gap solution where we provide an finish message and remove it on the receiving side.

tegefaulkes commented 1 year ago

I'm reasonably sure that adding data to the finish message is a stop-gap fix for this. That means I can move forward with fixing the tests and resolving this issue.

There is still an underlying problem that needs to be investigated. A reproducible example found and an issue posted upstream.

For the sake of book keeping I'll complete this issue and create a new issue for tracking and working on the remaining problem.

CMCDragonkai commented 1 year ago

Did you post upstream?

CMCDragonkai commented 1 year ago

Also use a single null byte if you need to do that.

CMCDragonkai commented 1 year ago

Give me a reproducible rust example to post upstream before closing this issue. We can use the temporary fix for now but upstream issue should be posted.

tegefaulkes commented 1 year ago

I haven't posted upstream yet, I need to create a rust example first.
I'll make it a null byte.
I will split out the problem into a new issue. This one is still technically for concurrency testing and can be finished with the temp fix.

tegefaulkes commented 1 year ago

Almost ready to merge changes. Just need to clean up commit history.

After that I'll create a new issue for tracking the fin frame/stream clean up problem and start of the reproducible rust example and the upstream issue.

tegefaulkes commented 1 year ago

New issue created at #20 for tracking the remaining problem

tegefaulkes commented 1 year ago

Merged this into master now, resolving.

I'll start on the remaining problem #20 now.

MatrixAI / js-quic