bilibili / quiche

Apache License 2.0
198 stars 44 forks source link

请问您们有没有测过quiche最大吞吐量能跑到多少呢? #10

Open adcen0107 opened 2 years ago

ktprime commented 2 years ago

测试受限于cpu性能, 我在9700上ping-pong跑大概540-600MB/s. tcp可以到1800MB

adcen0107 commented 2 years ago

看着差距还是挺大的,请问你使用的是专门跑吞吐的 quiche demo吗?

ktprime commented 2 years ago

专门高度优化后的demo测试quic性能. 性能差距主要发包(sendto. sendmmg) 还有部分在gquic协议栈.

Rouzip commented 2 years ago

测试受限于cpu性能, 我在9700上ping-pong跑大概500MB/s. tcp可以到1800MB

请问下这个500MB/s是怎么测试的呢,用QUIC做代理统计有效流量还是网卡上传输UDP传输数据包流量?什么条件下测试的呢,一个connection下的一个stream?

ktprime commented 2 years ago

测试受限于cpu性能, 我在9700上ping-pong跑大概540-600MB/s. tcp可以到1800MB

请问下这个500MB/s是怎么测试的呢,用QUIC做代理统计有效流量还是网卡上传输UDP传输数据包流量?什么条件下测试的呢,一个connection下的一个stream?

自己写的quic客户端和服务器demo,带宽统计有效stream载荷数据,一个连接一个流(多流不能提升ping-pong性能) 服务器绑定127.0.0.1, 客户端连上后预先发送10-1000KB数据(握手协商关闭加密功能),服务器收到直接发回客户端

编译优化开了LTO和PGO,使用GCC10.3 和Ubutun 20.04(win WSL2)。 为了提升性能做了聚合回包(sendmmsg/gso)之类的大量性能优化。

ktprime commented 8 months ago

经过多年持续度gquic 性能极致优化(基于2023.05版本quiche),剪裁部分功能,

在win10 + wsl2 + 12700 cpu,ping-pong测试能跑出1GB+的带宽性能,单核心处理(收+发)80万+个 quic数据包。 应用层cpu从45%降到25%, 后续打算优化系统cpu, 如果配合DPDK,估计cpu性能还能提升一倍以上

Statics:1118 [C=10001]  1 ep_fd/last_udp = 1/3, active_udps, udps = 1, 1 tid = 19: client send_batch:1
Statics:1121 [C=10001]  2 ep_conns/all_conns/ep_streams/all_entry = 1|1|1|1, fail_conns/new_conns = 0|0
Statics:1124 [C=10001]  3 poll_calls = 13860, notify_calls = 0, send_calls = 20929, recv_calls = 13860, time_calls = 92199, event_calls = 13860
Statics:1127 [C=10001]  4 send_packets, recv_packets = 812988, 806831/s send_bytes, recv_bytes = 1091022.23, 1084537.95 KB/s
Statics:1130 [C=10001]  5 timer size = 2, once/runs/schedules = 63/  0/  63 /sec
Statics:1133 [C=10001]  6 thread cpu(user_time, sys_time) = 98.93% (24.05%, 74.88%) process 98.93% mem:11.00 MB
Statics:1136 [C=10001]  7 recv, sent, migrations, all_discons, online_sec = 108865573, 109765566, 0, 0, 0 sec
Statics:1141 [C=10001]  8 retrans, loss, duplicate, error, fail_conn, zero_conn =(0.00%%, 0.00%%, 0.00%%, 0.00%%, 0.00%, 100.00%) online = 0.04 hr

DumpQuicStats:924 [C=10001]     1 send,recv,slow_sent = 109765566, 108865573, 9219057:  (quid = 1)
DumpQuicStats:929 [C=10001]     2 loss_timeout, pto = 0, 1, lost, transmit = 27 25 (sp_trans 0. sp_lost 1)
DumpQuicStats:932 [C=10001]     3 slowstart_packets_lost,tcp_loss_events,packets_reordered = 14, 3, 0
DumpQuicStats:934 [C=10001]     4 min_rtt_us, srtt_us, max_reordering/max_send_packet = 24 100 us, 0/1470
DumpQuicStats:938 [C=10001]     5 bw = 2815380 k/s [trans:0.00%% 127.0.0.1:10060] |online = 135 sec, last_recv/last_send = 1/0 ms
DumpQuicStats:941 [C=10001] status =  packets_sent: 109765566 packets_received: 108865573 stream_bytes_received: 148342862802 bytes_retransmitted: 31375 packets_retransmitted: 25 packets_lost: 27 slowstart_packets_sent: 9219057 slowstart_packets_lost: 14 slowstart_bytes_lost: 13225 pto_count: 1 min_rtt_us: 24 srtt_us: 100 egress_mtu: 1470 max_egress_mtu: 1470 ingress_mtu: 1470 estimated_bandwidth: 22.52 Gbits/s (2.82 Gbytes/s) tcp_loss_events: 3 }

:147   binheap[ 1] size = 2, ups/ups_downs/runs = 50546/99%/122
-----------------------------------------------------------------------TPS: 79778/sec, QPS: 266905/sec, RBW 1042.60, SBW: 1042.60 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 79544/sec, QPS: 264894/sec, RBW 1034.74, SBW: 1034.74 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 80098/sec, QPS: 266174/sec, RBW 1039.74, SBW: 1039.74 MB/sec Cons: 1
FreeMind-LJ commented 6 months ago

经过多年持续度gquic 性能极致优化(基于2023.05版本quiche),剪裁部分功能,

在win10 + wsl2 + 12700 cpu,ping-pong测试能跑出1GB+的带宽性能,单核心处理(收+发)80万+个 quic数据包。 应用层cpu从45%降到25%, 后续打算优化系统cpu, 如果配合DPDK,估计cpu性能还能提升一倍以上

Statics:1118 [C=10001]  1 ep_fd/last_udp = 1/3, active_udps, udps = 1, 1 tid = 19: client send_batch:1
Statics:1121 [C=10001]  2 ep_conns/all_conns/ep_streams/all_entry = 1|1|1|1, fail_conns/new_conns = 0|0
Statics:1124 [C=10001]  3 poll_calls = 13860, notify_calls = 0, send_calls = 20929, recv_calls = 13860, time_calls = 92199, event_calls = 13860
Statics:1127 [C=10001]  4 send_packets, recv_packets = 812988, 806831/s send_bytes, recv_bytes = 1091022.23, 1084537.95 KB/s
Statics:1130 [C=10001]  5 timer size = 2, once/runs/schedules = 63/  0/  63 /sec
Statics:1133 [C=10001]  6 thread cpu(user_time, sys_time) = 98.93% (24.05%, 74.88%) process 98.93% mem:11.00 MB
Statics:1136 [C=10001]  7 recv, sent, migrations, all_discons, online_sec = 108865573, 109765566, 0, 0, 0 sec
Statics:1141 [C=10001]  8 retrans, loss, duplicate, error, fail_conn, zero_conn =(0.00%%, 0.00%%, 0.00%%, 0.00%%, 0.00%, 100.00%) online = 0.04 hr

DumpQuicStats:924 [C=10001]     1 send,recv,slow_sent = 109765566, 108865573, 9219057:  (quid = 1)
DumpQuicStats:929 [C=10001]     2 loss_timeout, pto = 0, 1, lost, transmit = 27 25 (sp_trans 0. sp_lost 1)
DumpQuicStats:932 [C=10001]     3 slowstart_packets_lost,tcp_loss_events,packets_reordered = 14, 3, 0
DumpQuicStats:934 [C=10001]     4 min_rtt_us, srtt_us, max_reordering/max_send_packet = 24 100 us, 0/1470
DumpQuicStats:938 [C=10001]     5 bw = 2815380 k/s [trans:0.00%% 127.0.0.1:10060] |online = 135 sec, last_recv/last_send = 1/0 ms
DumpQuicStats:941 [C=10001] status =  packets_sent: 109765566 packets_received: 108865573 stream_bytes_received: 148342862802 bytes_retransmitted: 31375 packets_retransmitted: 25 packets_lost: 27 slowstart_packets_sent: 9219057 slowstart_packets_lost: 14 slowstart_bytes_lost: 13225 pto_count: 1 min_rtt_us: 24 srtt_us: 100 egress_mtu: 1470 max_egress_mtu: 1470 ingress_mtu: 1470 estimated_bandwidth: 22.52 Gbits/s (2.82 Gbytes/s) tcp_loss_events: 3 }

:147   binheap[ 1] size = 2, ups/ups_downs/runs = 50546/99%/122
-----------------------------------------------------------------------TPS: 79778/sec, QPS: 266905/sec, RBW 1042.60, SBW: 1042.60 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 79544/sec, QPS: 264894/sec, RBW 1034.74, SBW: 1034.74 MB/sec Cons: 1
-----------------------------------------------------------------------TPS: 80098/sec, QPS: 266174/sec, RBW 1039.74, SBW: 1039.74 MB/sec Cons: 1

有啥优化思路

ktprime commented 6 months ago

quiche底层的核心数据基本都被换成高性能的数据结构(small vector/map/set/list/hash). 原来使用的absl系列容器并不适合的小数据集合。

时间和定时器开销不小,采用极致的优化方法减少了大量的不必要的高频调用(90%)

gcov 测试各种分支覆盖情况,改进if 判断语句(收发报文核心路径删除大约20% if 语句,去除不常用的部分功能) 大量防御代码基本都被删除了(需要进行大量功能性和稳定性测试,)

最好是内存优化,改进后正常的收发报文过程中不再有动态内存分配。

配合perf+lto 最后分析瓶颈, 都是代码细节问题。密切关注核心路径的每一行代码副作用

ktprime commented 6 months ago

perfs