ethereum / utp

uTorrent transport protocol
MIT License
28 stars 13 forks source link

feat: switch to one way acks and fix bugs #118

Closed KolbyML closed 8 months ago

KolbyML commented 11 months ago

Why

Every other uTP implementation uses one way fins.

We on the other hand use two way fins, like TCP. This has some advantages in some cases. But every usecase of uTP has been sending data one way.

Because we use two way and fluffy uses one way we have this problem where we only try to handle one way fins on the timeout of our connection which delays us finishing the uTP connection currently on Trin this is set to 1.25 seconds, this delay might also grow in size but that symantic interaction doesn't matter and any delay is enough justification to make this change. This delay will happen whenever we interact with Fluffy or Ultralight and if we want adoption this is unacceptable.

This is also currently causing interop issues with our bridge, well we can add another timeout exception for this case, it wouldn't fix the 1.25s delay so we might as well just implement 1 way fins.

https://portal-hive.ethdevops.io/?page=v-pills-results-tab&suite=1701568288-22f1ef76b380011a7f14fa3c85205088.json Here is an example where these tests are failing because Trin thinks the uTP transfer is timing out, fluffy thinks it is successful. And in reality fluffy got the data, and we are waiting for no reason. We also spam fluffy with 8 fin packets for no reason which is another reason to do 1 way fins.

I tested this change on portal-hive and all tests passed using it using Trin and Fluffy. I tested the trin-bridge test again between fluffy and trin and it worked so it definitly fixes the issue I was seeing. In this process I also double checked the uTP logs to make sure there wasn't anything funny.

INFO[12-02|22:09:03] hiveproxy started                        container=814974ee4b19 addr=172.17.0.2:8081
INFO[12-02|22:09:03] API: suite started                       suite=0 name=trin-bridge-tests
INFO[12-02|22:09:03] API: test started                        suite=0 test=1 name="Trin bridge tests"
INFO[12-02|22:09:03] API: test started                        suite=0 test=2 name="Bridge test. A:fluffy --> B:fluffy"
INFO[12-02|22:09:04] API: client fluffy started               suite=0 test=2 container=5907afe4
INFO[12-02|22:09:04] API: client fluffy started               suite=0 test=2 container=6567d36d
INFO[12-02|22:09:04] API: client trin-bridge started          suite=0 test=2 container=fccdaf5f
INFO[12-02|22:09:18] API: test ended                          suite=0 test=2 pass=true
INFO[12-02|22:09:18] API: test started                        suite=0 test=3 name="Bridge test. A:fluffy --> B:trin"
INFO[12-02|22:09:18] API: client fluffy started               suite=0 test=3 container=f34274b2
INFO[12-02|22:09:19] API: client trin started                 suite=0 test=3 container=169b4b63
INFO[12-02|22:09:19] API: client trin-bridge started          suite=0 test=3 container=e28e7288
INFO[12-02|22:09:33] API: test ended                          suite=0 test=3 pass=true
INFO[12-02|22:09:33] API: test started                        suite=0 test=4 name="Bridge test. A:trin --> B:fluffy"
INFO[12-02|22:09:33] API: client trin started                 suite=0 test=4 container=80aad6e7
INFO[12-02|22:09:34] API: client fluffy started               suite=0 test=4 container=420b1cbc
INFO[12-02|22:09:34] API: client trin-bridge started          suite=0 test=4 container=ac997bce
INFO[12-02|22:09:47] API: test ended                          suite=0 test=4 pass=true
INFO[12-02|22:09:47] API: test started                        suite=0 test=5 name="Bridge test. A:trin --> B:trin"
INFO[12-02|22:09:48] API: client trin started                 suite=0 test=5 container=e9f6a740
INFO[12-02|22:09:48] API: client trin started                 suite=0 test=5 container=4bd2c4c7
INFO[12-02|22:09:49] API: client trin-bridge started          suite=0 test=5 container=532f876e
INFO[12-02|22:10:02] API: test ended                          suite=0 test=5 pass=true
INFO[12-02|22:10:02] API: test ended                          suite=0 test=1 pass=true
INFO[12-02|22:10:02] API: suite ended                         suite=0
INFO[12-02|22:10:03] simulation trin-bridge finished          suites=1 tests=5 failed=0

Here are the results with the change, the link above shows it all failing.

what did I do

KolbyML commented 11 months ago

CI is failing because of this test close_succeeds_if_only_fin_ack_dropped

    let mut recv_stream = recv_stream_handle.await.unwrap();

    match timeout(EXPECTED_IDLE_TIMEOUT * 2, recv_stream.close()).await {
        Ok(Ok(_)) => {} // The receive stream should close successfully: only FIN-ACK is missing
        Ok(Err(e)) => panic!("Error closing receive stream: {:?}", e),
        Err(e) => {
            panic!("The recv stream did not timeout on close() fast enough, giving up after: {e:?}")
        }
    };

on line Ok(Err(e)) => panic!("Error closing receive stream: {:?}", e), with Error closing receive stream: Kind(NotConnected)

The issue is because the connection is already closed because we do one way fins now. So when we try to do .close() we get this error cause we are trying to close a closed stream. So I assume we remove the close check on recv_stream @carver thoughts?

carver commented 11 months ago

Is something like the test change in 9d4b185 what you are imagining?

Can you test cargo test --test socket locally? I'm finding that on master I get:

finished real udp load test of 1000 simultaneous transfers, in 8.156128821s, at a rate of 981 Mbps

On this PR I get timeout failures locally. Ah I see it now in CI too.

on_packet() is called a lot. Maybe the three separate destructuring calls to self.state has a measurable performance impact? Or maybe the PR introduces some kind of issue when in a packet-dropping context.

KolbyML commented 10 months ago

@carver ok the PR should be ready for another review now. I thought about it a bit after the call and I think the suggestions you made about a middle ground and the idea's around it were pretty good so I implemented them here. Having a dedicated read thread for UDP made a big difference in performance as well and from my readings it seems like a common practice. I think the numbers speak for themselves. The big thing though is it improves overall consistency which to me is more important the improvement in throughput.

With this middle ground in place I think we are in a position were we can get this PR in.

KolbyML commented 10 months ago

Converting this to a draft since after some reading and such I don't think one-way-acks is that much of a priority in the grand scheme of changes or improvements which can be made. As I read and learnt more my understanding changed and the only case this would really fix is why our bridge using trace_gossip isn't working with clients other then trin.

read_to_eof() should return once the recv fin is fully acknowledge. Currently this takes 2 instead of 1 calls to process_reads() which is redundent because how the read_buffer gets data in a lock step formation, so my change which removes that lock step can be merged to resolve that.

Currently the bridge isn't working correctly because it is relying on the return status of close to check if a connection worked correctly. But Trin recieving data from any client doesn't rely on the status of close, so any possible latency delays caused from closing logic wouldn't have an impact on latency of getting the data from the connection. There is some merit to having better interop on closing, but it isn't as big of priority as other changes I found we could make which would actual improve congestion, etc etc.

KolbyML commented 8 months ago

:partying_face: 1 way acks is back on!!!! I have figured out the missing links and also reverted the main blocker for 1 way acks sending local_fin after all data packets are already acted

Because with the realization I made today we can do 1 way acks without the compremise of delaying the closing of our connection!!!.

The fix which made this possible was 2 findings

KolbyML commented 8 months ago

I have tested it quite a bit and this PR is performing as I would expect. So I am opening it for review.

KolbyML commented 8 months ago

The main way I am testing these changes is with Portal Hive. Our trin-bridge tests replicates the issue of us sleeping 60 seconds when interoping with fluffy. With this PR that issue is gone.

The other benchmark for success is if we pass the pseudo benchmark we have in this repo which tests for if we properly handle high stress situations which this PR is handling well.