Closed KolbyML closed 8 months ago
CI is failing because of this test close_succeeds_if_only_fin_ack_dropped
let mut recv_stream = recv_stream_handle.await.unwrap();
match timeout(EXPECTED_IDLE_TIMEOUT * 2, recv_stream.close()).await {
Ok(Ok(_)) => {} // The receive stream should close successfully: only FIN-ACK is missing
Ok(Err(e)) => panic!("Error closing receive stream: {:?}", e),
Err(e) => {
panic!("The recv stream did not timeout on close() fast enough, giving up after: {e:?}")
}
};
on line Ok(Err(e)) => panic!("Error closing receive stream: {:?}", e),
with Error closing receive stream: Kind(NotConnected)
The issue is because the connection is already closed because we do one way fins now. So when we try to do .close() we get this error cause we are trying to close a closed stream. So I assume we remove the close check on recv_stream
@carver thoughts?
Is something like the test change in 9d4b185 what you are imagining?
Can you test cargo test --test socket
locally? I'm finding that on master I get:
finished real udp load test of 1000 simultaneous transfers, in 8.156128821s, at a rate of 981 Mbps
On this PR I get timeout failures locally. Ah I see it now in CI too.
on_packet()
is called a lot. Maybe the three separate destructuring calls to self.state
has a measurable performance impact? Or maybe the PR introduces some kind of issue when in a packet-dropping context.
@carver ok the PR should be ready for another review now. I thought about it a bit after the call and I think the suggestions you made about a middle ground and the idea's around it were pretty good so I implemented them here. Having a dedicated read thread for UDP made a big difference in performance as well and from my readings it seems like a common practice. I think the numbers speak for themselves. The big thing though is it improves overall consistency which to me is more important the improvement in throughput.
With this middle ground in place I think we are in a position were we can get this PR in.
Converting this to a draft since after some reading and such I don't think one-way-acks is that much of a priority in the grand scheme of changes or improvements which can be made. As I read and learnt more my understanding changed and the only case this would really fix is why our bridge using trace_gossip isn't working with clients other then trin.
read_to_eof() should return once the recv fin is fully acknowledge. Currently this takes 2 instead of 1 calls to process_reads()
which is redundent because how the read_buffer gets data in a lock step formation, so my change which removes that lock step can be merged to resolve that.
Currently the bridge isn't working correctly because it is relying on the return status of close to check if a connection worked correctly. But Trin recieving data from any client doesn't rely on the status of close, so any possible latency delays caused from closing logic wouldn't have an impact on latency of getting the data from the connection. There is some merit to having better interop on closing, but it isn't as big of priority as other changes I found we could make which would actual improve congestion, etc etc.
:partying_face: 1 way acks is back on!!!! I have figured out the missing links and also reverted the main blocker for 1 way acks sending local_fin after all data packets are already acted
Because with the realization I made today we can do 1 way acks without the compremise of delaying the closing of our connection!!!.
The fix which made this possible was 2 findings
ethereum/utp
we weren't sending reset packets at all :scream:. That is a good find and I fixed it!I have tested it quite a bit and this PR is performing as I would expect. So I am opening it for review.
The main way I am testing these changes is with Portal Hive. Our trin-bridge tests replicates the issue of us sleeping 60 seconds when interoping with fluffy. With this PR that issue is gone.
The other benchmark for success is if we pass the pseudo benchmark we have in this repo which tests for if we properly handle high stress situations which this PR is handling well.
Why
Every other uTP implementation uses one way fins.
We on the other hand use two way fins, like TCP. This has some advantages in some cases. But every usecase of uTP has been sending data one way.
Because we use two way and fluffy uses one way we have this problem where we only try to handle one way fins on the timeout of our connection which delays us finishing the uTP connection currently on Trin this is set to 1.25 seconds, this delay might also grow in size but that symantic interaction doesn't matter and any delay is enough justification to make this change. This delay will happen whenever we interact with Fluffy or Ultralight and if we want adoption this is unacceptable.
This is also currently causing interop issues with our bridge, well we can add another timeout exception for this case, it wouldn't fix the 1.25s delay so we might as well just implement 1 way fins.
https://portal-hive.ethdevops.io/?page=v-pills-results-tab&suite=1701568288-22f1ef76b380011a7f14fa3c85205088.json Here is an example where these tests are failing because Trin thinks the uTP transfer is timing out, fluffy thinks it is successful. And in reality fluffy got the data, and we are waiting for no reason. We also spam fluffy with 8 fin packets for no reason which is another reason to do 1 way fins.
I tested this change on portal-hive and all tests passed using it using Trin and Fluffy. I tested the trin-bridge test again between fluffy and trin and it worked so it definitly fixes the issue I was seeing. In this process I also double checked the uTP logs to make sure there wasn't anything funny.
Here are the results with the change, the link above shows it all failing.
what did I do
on_packet()
for the sender and receiver cases for handling one way ackswrite-buffer
was written. The while pending_read.pop_front statements doesn't work, because it relied on a gridlock system so it would be impossible for it to even loop more then once. Because the loop in stream.rs only added a new one shot channel, when it was done handling the last one. So the while loops was functionally useless inprocess_reads
. I also needed to do it this way so we would be able to flush the recv-buffer before closing the connection.