eclipse-zenoh / zenoh

zenoh unifies data in motion, data in-use, data at rest and computations. It carefully blends traditional pub/sub with geo-distributed storages, queries and computations, while retaining a level of time and space efficiency that is well beyond any of the mainstream stacks.
https://zenoh.io
Other
1.43k stars 151 forks source link

Use TCP MSS as TCP link MTU #1214

Closed Mallets closed 2 months ago

Mallets commented 3 months ago

This PR retrieves the configured TCP MSS (Maximum Segment Size) for TCP-based stream (i.e. TCP and TLS) and uses the MSS value as advertised link MTU. This allows to choose the Zenoh frame size more wisely based on the reported information.

kydos commented 3 months ago

I am not sure we should merge this change as this leads to too much fragmentation at a Zenoh level. I think that for the time being the only thing we should do is to slightly reduce the default batch size and make its default to 64512.

Based on my experiments this gives the best performances.

Mallets commented 3 months ago

I believe 64512 is an arbitrary number that may lead to suboptimal behaviour depending on the actual TCP configuration. Moreover, default values are defined in https://datatracker.ietf.org/doc/rfc6691/, which is what is used by this PR in case mss() values can't be retrieved because of platform limitations.

A better approach would be to compute the largest multiple of the MSS that is smaller of the largest possible batch size. By doing so, TCP-based streams are:

The PR has already been updated with the proposed approached. @kydos

kydos commented 3 months ago

I agree with this approach and based on some early tests this does eliminate the problem of "left-over" packets and does not degrade degrade performance. Let's do another round of validation on our test bed and then let's merge.

Mallets commented 2 months ago

Test on:

To avoid memory allocations at reception I've increased pre-allocated RX buffer size. Config file rx.json5:

{
  mode: "peer",
  transport: {
    link: {
      rx: {
        buffer_size: 2000000, // 2MB
      },
    },
  },
}

Throughput

Run subscriber:

./target/release/examples/z_sub_thr -c ~/zenoh/rx.json5 -l tcp/127.0.0.1:7447 --no-multicast-scouting -s 1000 -n 4000

Run publisher:

./target/release/examples/z_pub_thr 1000000 -e tcp/127.0.0.1:7447 --no-multicast-scouting

Results on dev/1.0.0:

6420.894773036098 msg/s
6438.685549234518 msg/s
6420.968613035673 msg/s
6413.0449645886 msg/s
6428.729333087685 msg/s
6422.802902837155 msg/s
6407.282199146353 msg/s
6423.25416357302 msg/s
6421.126801676679 msg/s
6404.145121354668 msg/s

Results of this PR:

8607.574068244794 msg/s
8293.492466765769 msg/s
8458.31621555157 msg/s
8437.033314071094 msg/s
8379.92313963826 msg/s
8427.817862762067 msg/s
8298.712109174901 msg/s
8028.470111113705 msg/s
8157.1628992629685 msg/s
7914.323240501607 msg/s

Same test but on 100Gbps fiber.

Results on dev/1.0.0:

7173.104176182686 msg/s
7052.261711018855 msg/s
7068.782212749803 msg/s
7203.251661611427 msg/s
7168.573337219441 msg/s
7212.233055060379 msg/s
7151.278130419619 msg/s
7180.462932595089 msg/s
7102.579467401252 msg/s
7054.187927908252 msg/s

Results of this PR:

7370.13108596221 msg/s
7330.315336219838 msg/s
7282.944568629472 msg/s
7329.346178784537 msg/s
7293.924903806076 msg/s
7335.100513084046 msg/s
7265.2923820018295 msg/s
7226.501236347771 msg/s
7208.940834081519 msg/s
7197.633343301336 msg/s

Latency

Run pong:

./target/release/examples/z_pong -c rx.json5 -l tcp/127.0.0.1:7447 --no-multicast-scouting 

Run ping:

./target/release/examples/z_ping 1000000 -c rx.json5  -e tcp/127.0.0.1:7447 --no-multicast-scouting

Results on dev/1.0.0:

1000000 bytes: seq=90 rtt=316µs lat=158µs
1000000 bytes: seq=91 rtt=326µs lat=163µs
1000000 bytes: seq=92 rtt=314µs lat=157µs
1000000 bytes: seq=93 rtt=357µs lat=178µs
1000000 bytes: seq=94 rtt=325µs lat=162µs
1000000 bytes: seq=95 rtt=327µs lat=163µs
1000000 bytes: seq=96 rtt=316µs lat=158µs
1000000 bytes: seq=97 rtt=331µs lat=165µs
1000000 bytes: seq=98 rtt=309µs lat=154µs
1000000 bytes: seq=99 rtt=333µs lat=166µs

Results of this PR:

1000000 bytes: seq=90 rtt=282µs lat=141µs
1000000 bytes: seq=91 rtt=247µs lat=123µs
1000000 bytes: seq=92 rtt=294µs lat=147µs
1000000 bytes: seq=93 rtt=261µs lat=130µs
1000000 bytes: seq=94 rtt=285µs lat=142µs
1000000 bytes: seq=95 rtt=250µs lat=125µs
1000000 bytes: seq=96 rtt=292µs lat=146µs
1000000 bytes: seq=97 rtt=274µs lat=137µs
1000000 bytes: seq=98 rtt=302µs lat=151µs
1000000 bytes: seq=99 rtt=259µs lat=129µs

Same test but on 100Gbps fiber.

Results on dev/1.0.0:

1000000 bytes: seq=90 rtt=503µs lat=251µs
1000000 bytes: seq=91 rtt=475µs lat=237µs
1000000 bytes: seq=92 rtt=472µs lat=236µs
1000000 bytes: seq=93 rtt=531µs lat=265µs
1000000 bytes: seq=94 rtt=521µs lat=260µs
1000000 bytes: seq=95 rtt=520µs lat=260µs
1000000 bytes: seq=96 rtt=513µs lat=256µs
1000000 bytes: seq=97 rtt=526µs lat=263µs
1000000 bytes: seq=98 rtt=525µs lat=262µs
1000000 bytes: seq=99 rtt=511µs lat=255µs

Results of this PR:

1000000 bytes: seq=90 rtt=469µs lat=234µs
1000000 bytes: seq=91 rtt=458µs lat=229µs
1000000 bytes: seq=92 rtt=477µs lat=238µs
1000000 bytes: seq=93 rtt=460µs lat=230µs
1000000 bytes: seq=94 rtt=463µs lat=231µs
1000000 bytes: seq=95 rtt=460µs lat=230µs
1000000 bytes: seq=96 rtt=470µs lat=235µs
1000000 bytes: seq=97 rtt=456µs lat=228µs
1000000 bytes: seq=98 rtt=467µs lat=233µs
1000000 bytes: seq=99 rtt=458µs lat=229µs

Overhead

Total network overhead for 1000 ping/pong sample, including all payloads and headers from Ethernet to TCP. The total number of bytes also includes all TCP ACKs.

dev/1.0.0: 1495350279 bytes. I.e. ~1.5GB of data. This PR: 1187915314 bytes. I.e. ~1.2GB of data.

Based on the above results I believe it is safe to merge this PR.

kydos commented 2 months ago

Looks good!