Closed tegefaulkes closed 1 year ago
QUICSocket.registerClient
is a function that isn't implemented yet. What is it for? Do we need to track what clients are using the socket? I suppose we'd need it if stopping the socket cleans up anything using it such as the server and clients.
Thoughts @CMCDragonkai ?
Yes you need to be able to keep track of clients too. You have to track all clients and servers of a given socket...
Lifecycle rules too. Can't let a socket be destroyed with clients that exist.
So far for any tests on the connection level were I need to test a connection to a server I've been putting them in the QUICClient.test.ts
test file. This file is getting pretty large, I may want to split it up into files that test certain aspects. Such as the server/client concurrency testing.
You can separate it with QUICClient.X.test.ts
. Where X
are the different variants.
Destroying a connection might be hanging if no streams were created. It might be test specific, I'll need to dig into this.
Found a bug. In certain scenarios client streams are not fulling cleaning up when they are meant to be ended. This is similar to the previous one I fixed where streams are not listed as readable or writable after they have finished. This means we are not checking them after the last state change. Previously this was fixed by adding streams to a short list to be checked when destroying.
The problem here however is that we're not entering the destroying state, we're waiting for the readable side to end during normal operation. So the previous fix to the logic doesn't apply here.
I think given this we need to check all active streams indiscriminately. This means an O(n)
loop every time we receive data on a connection. This is frustratingly not ideal from a performance perspective but it's still linear time so not the end of the world.
Notably, the examples don't maintain state for the streams. but it does for the connections. Every time the socket receives a packet, it handles the packet and processes any events on the streams, readable and writable.
After processing all incoming packets, it process all clients for outgoing packets, then checks all clients again if they have closed so it can clean them up.
I think that means the only solution here is to check every active stream for a state change after we process a recv for that connection. but we still need to check the readable list for new streams.
The available methods here for processing events are fine for the example since it processes reads and write statelessly. if the stream has ended then we're just not processing them as part of the readable or writable list.
Ultimately the solution here is to process the readable list so we can create state for new streams. We can then iterate over all the existing streams checking if they're writable and process any state changes.
There is an unknown here. If the stream has ended with a shutdown before any data was sent to create remote state. Is that stream considered readable? Would state for it be created? Do we care if it isn't?
Anyway, changes to be made include...
isWritable()
, check the status of the readable and writable state and trigger any state change for the QUICStream
.STREAM_RESET
event for the readable. Would that stream be readable at all? would we create state for it? would there be anything to process?I'm just confused now. I've made changes so that after every receive and send event we check the state for every existing stream. So we shouldn't be missing any state changes now.
I can see one side sending the fin frame in all cases. But some times for no reason I can tell, the receiving side never sees this finish frame? weather this happens at all can be very random.
Assuming all packets are being properly sent and processed then I have no idea why this is still happening.
I'm going to take a step back from this for a little bit and focus on some of the other issues. Some solutions may come while I mull it over.
This sounds pretty similar to my problem.
Posted an upstream issue https://github.com/cloudflare/quiche/issues/1507
Things to try:
isFinished
and error message for the stream. Indicate clearly that the stream is not finishing.Here's some ideas as to how to debug this.
We need to use stream_finished
to confirm to us whether the stream is in fact finished or not. Here's an example:
// THIS WILL BE TRUE IF fin or RESET_STREAM
// Either graceful or errored
console.log(
'STREAM RECV GRACEFUL OR ERROR',
this.streamId,
this.conn.streamFinished(this.streamId)
);
We should do this after calling this.conn.streamRecv(this.streamId, buf);
.
This just confirms to us even if we get an exception, whether the stream is really finished.
Next thing is to do a more controlled test.
Setup your 2 servers with many clients. Maybe reduce this to lots of clients to 1 server.
Create a bunch of streams. Then sequentially start closing the streams. On each step, check if the stream is finished on the other side. Then also consider doing it in batches, like 2 by 2, 3 by 3, 50 by 50... so on.
We need to ensure that it's not our code that's causing the problem.
1 client with a lot of streams to 1 server.
Many clients with each 1 stream to the 1 server.
Many clients with multiple streams to 1 server.
Choose only 1 side to close the stream. Just close it on the client side. Reduce the amount of randomness.
@tegefaulkes your current theory is that when the client closes the stream, instead of the server receiving a fin
, it ends just up throwing an exception as Done
. This is what we want to prove. To do this we must reduce randomness, and just sequentially or batching up the closes on the client until we see a problem.
If client side doesn't resolve in this problem, then do it on the server side.
Scenarios to prove:
Found a new oddity. When the problem happens, we're not receiving ALL of the messages but only when the problem happens. Running tests with low congestion does not have this problem.
this is VERY odd. a quiche stream is meant to maintain data integrity. So is the problem happening between the quiche stream and our QUICStream
? But this partially explains the problem. If we're loosing random messages then we're randomly loosing our finish message.
I'll dig into this further and post my other observations in a moment.
Make sure you're testing with 64 bytes and 64 bytes. There's no equivalence on the number of stream writes and stream reads. Use buffer comparison instead.
Notes:
The missing bytes are always at the end of the messages sent.
I was expecting the message with the fin frame to be empty. but during congestion this is not the case, the final packet can contain part of the message. This is a simple fix, I just wasn't writing the message with the fin frame to the stream.
That may be part of the problem but it doesn't explain the streams not cleaning up. I'm still digging.
Ok, things are fixed up so that I can see all of the data is being sent and receive.
In one example with the stream closing problem I can see the following.
So after fixing the missing data, we're back to the original problem.
If I send some data along with the fin frame, in this case !END!
, I can't seem to trigger the problem even after doubling the number of streams. I can't really say for certain that this is a fix though.
You mentioned https://github.com/cloudflare/quiche/commit/cc98af06cb1240fb1b1be3557166ca443b1d8653 where they patched the quiche code to allow 0 buffer fin frames to be sent.
It's possible they missed an edge case where they're not sending or processing a fin frame. I think I can verify this pretty easily if I can modify the quiche
source code with some debug outputs.
If including data within the fin frame does prevent the problem, then that is a stop gap solution where we provide an finish message and remove it on the receiving side.
I'm reasonably sure that adding data to the finish message is a stop-gap fix for this. That means I can move forward with fixing the tests and resolving this issue.
There is still an underlying problem that needs to be investigated. A reproducible example found and an issue posted upstream.
For the sake of book keeping I'll complete this issue and create a new issue for tracking and working on the remaining problem.
Did you post upstream?
Also use a single null byte if you need to do that.
Give me a reproducible rust example to post upstream before closing this issue. We can use the temporary fix for now but upstream issue should be posted.
Almost ready to merge changes. Just need to clean up commit history.
After that I'll create a new issue for tracking the fin frame/stream clean up problem and start of the reproducible rust example and the upstream issue.
New issue created at #20 for tracking the remaining problem
Merged this into master now, resolving.
I'll start on the remaining problem #20 now.
Specification
We need to create tests that create an arbitrary number of connections across an arbitrary number of servers and clients.
The ultimate goal is to test that connection multiplexing and that connections function properly when run concurrently.
All the tests should be sending at least 1 stream of random data
Additional context
Tasks