apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.8k stars 796 forks source link

Updated http3 benchmark #11518

Open bryancall opened 2 months ago

bryancall commented 2 months ago
**http2load**
finished in 65.00s, 83685.82 req/s, 94.49MB/s
requests: 5021149 total, 5021149 started, 5021149 done, 5021149 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 5021166 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 5.54GB (5944777841) total, 798.07MB (836837680) headers (space savings 35.31%), 5.19GB (5568962560) data
UDP datagram: 2993583 sent, 6781646 received
                     min         max         mean         sd        +/- sd
time for request:     2.78ms     41.58ms     10.85ms      4.10ms    75.84%
time for connect:        0us         0us         0us         0us     0.00%
time to 1st byte:        0us         0us         0us         0us     0.00%
req/s           :     484.90     1073.51      836.85      186.05    71.00%

**dstat**
You did not select any stats, using -cdngy by default.
----total-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send: recv  send|  in   out | int   csw
 64   7  28   0   0|  40k 1313k| 532B  532B:7632k   97M|0.20     0 |3491k  311k
 68   7  24   0   0|   0   506k| 532B  532B:8075k  102M|0.20     0 |3597k  318k
 69   7  24   0   0|   0   146k| 552B  552B:8112k  102M|4.60     0 |3698k  323k
 69   7  23   0   0|   0    35M| 532B  532B:8193k  104M|1.60     0 |3709k  321k
 69   7  23   0   0|   0     0 | 592B  592B:8335k  105M|1.00     0 |3759k  323k
 69   7  23   0   0|   0   267k| 532B  532B:8191k  103M|1.00     0 |3700k  322k
 69   7  24   0   0|   0     0 | 532B  532B:8245k  104M|0.40     0 |3720k  320k
 69   7  24   0   0|   0  1638B| 532B  532B:8128k  101M|0.60     0 |3691k  322k
 69   7  24   0   0|2257k   57M| 532B  532B:8177k  103M|11.6     0 |3694k  324k
 69   7  24   0   0|   0    17M| 532B  532B:8196k  102M|1.60     0 |3723k  321k
 69   7  23   0   0|   0     0 | 592B  592B:8264k  104M|1.40     0 |3746k  321k
 69   7  23   0   0|   0   825k| 532B  532B:8211k  103M|1.00     0 |3711k  321k
 69   7  23   0   0|   0   826k| 532B  532B:8257k  104M|1.60     0 |3728k  318k
**perf stat**
perf: 'stat-p' is not a perf-command. See 'perf --help'.
**perf report**
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:P'
# Event count (approx.): 5437933978065
#
#   Overhead  Shared Object         Symbol                                              IPC   [IPC Coverage]
# ..........  ....................  ..................................................  ....................
#
      45.77%  traffic_server        [.] freelist_new(_InkFreeList*)                     -      -
      10.75%  traffic_server        [.] freelist_free(_InkFreeList*, void*)             -      -
       1.54%  traffic_server        [.] ink_freelist_new(_InkFreeList*)                 -      -
       1.03%  libc.so.6             [.] __memmove_avx_unaligned_erms                    -      -
       1.02%  traffic_server        [.] IOBufferBlock::clear()                          -      -
       0.93%  libc.so.6             [.] _int_malloc                                     -      -
       0.72%  libquiche.so          [.] <alloc::string::String as core::fmt::Write>::w  -      -
       0.64%  libquiche.so          [.] core::fmt::write                                -      -
       0.62%  traffic_server        [.] thread_freeup(FreelistAllocator&, ProxyAllocat  -      -
       0.59%  [vdso]                [.] __vdso_clock_gettime                            -      -
       0.51%  traffic_server        [.] QPACK::_encode_header(MIMEField const&, unsign  -      -
       0.49%  libc.so.6             [.] _int_free                                       -      -
       0.46%  [kernel.kallsyms]     [k] __memcpy                                        -      -
       0.37%  libquiche.so          [.] quiche::Connection::send_single                 -      -
bryancall commented 2 months ago

About 23% idle CPU and a lot of memory allocation happening.

ywkaras commented 2 months ago

It looks like we are not using proxy allocators for H3:

wkaras ~/LOCAL_REPOS/TS
$ grep 'ProxyAllocator.*http[23]' $(findsrc)
./include/iocore/eventsystem/Thread.h:  ProxyAllocator http2ClientSessionAllocator;
./include/iocore/eventsystem/Thread.h:  ProxyAllocator http2ServerSessionAllocator;
./include/iocore/eventsystem/Thread.h:  ProxyAllocator http2StreamAllocator;
wkaras ~/LOCAL_REPOS/TS
$ 

Proxy allocators reduce allocs/frees with class allocators, which require freelist operations. Are we benchmarking with -f , where the freelist new and free just call stdlib malloc and free?

maskit commented 2 months ago

There are allocators for for H3 and QUIC, although some are not used:

$ git grep ClassAllocator src/proxy/http3/
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3Frame>         http3FrameAllocator("http3FrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3DataFrame>     http3DataFrameAllocator("http3DataFrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3HeadersFrame>  http3HeadersFrameAllocator("http3HeadersFrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3SettingsFrame> http3SettingsFrameAllocator("http3SettingsFrameAllocator");
$ git grep ClassAllocator src/iocore/net/ | grep -i quic
src/iocore/net/P_QUICNet.h:extern ClassAllocator<QUICPollEvent> quicPollEventAllocator;
src/iocore/net/P_QUICNetVConnection.h:extern ClassAllocator<QUICNetVConnection> quicNetVCAllocator;
src/iocore/net/QUICNet.cc:ClassAllocator<QUICPollEvent> quicPollEventAllocator("quicPollEvent");
src/iocore/net/QUICNetVConnection.cc:ClassAllocator<QUICNetVConnection> quicNetVCAllocator("quicNetVCAllocator");

But I don't think these ae the cause.

The heaviest user of freelist on the benchmark is probably udpPacketAllocator (and ioBlockAllocator used by UDPPacket class).

ywkaras commented 2 months ago

I see, it looks like there are proxy allocators named quic instead of http3:

wkaras ~/LOCAL_REPOS/TS
$ grep -F ProxyAllocator $(findsrc) | grep -Fi quic
./include/iocore/eventsystem/Thread.h:  ProxyAllocator quicNetVCAllocator;
./include/iocore/eventsystem/Thread.h:  ProxyAllocator quicClientSessionAllocator;
wkaras ~/LOCAL_REPOS/TS
$
ywkaras commented 2 months ago

Maybe we need a corresponding ProxyAllocator for the udpPacketAllocator class allocator. It would be interesting to compare the benchmark hotspots with and without -f .

maskit commented 2 months ago

Yes, that was suggested a few years ago, and nobody has tried it, even though that doesn't require any protocol knowledge.

ywkaras commented 2 months ago

I think we are not sure proxy and class allocators improve performance when using a general purpose allocator with per-thread arenas.

maskit commented 2 months ago

Yup, that's one of reasons I stopped using them for new code for H3 and QUIC (using them requires extra code, and there was an issue that constructors/destructors are not called).

ksqrtr commented 2 months ago

Can you please share trafficserver's config (records.yaml) used for this benchmark?

ksqrtr commented 2 months ago

And which branch was used for this benchmark? master or 10.0.x?

brbzull0 commented 1 month ago

did some tests locally in my small nuc, couldn't a very get similar result, but allocation seems a hotspot anyway

https://gist.github.com/brbzull0/19bcd10f135057d66a9540581c8b54b6