Open bryancall opened 2 months ago
About 23% idle CPU and a lot of memory allocation happening.
It looks like we are not using proxy allocators for H3:
wkaras ~/LOCAL_REPOS/TS
$ grep 'ProxyAllocator.*http[23]' $(findsrc)
./include/iocore/eventsystem/Thread.h: ProxyAllocator http2ClientSessionAllocator;
./include/iocore/eventsystem/Thread.h: ProxyAllocator http2ServerSessionAllocator;
./include/iocore/eventsystem/Thread.h: ProxyAllocator http2StreamAllocator;
wkaras ~/LOCAL_REPOS/TS
$
Proxy allocators reduce allocs/frees with class allocators, which require freelist operations. Are we benchmarking with -f , where the freelist new and free just call stdlib malloc and free?
There are allocators for for H3 and QUIC, although some are not used:
$ git grep ClassAllocator src/proxy/http3/
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3Frame> http3FrameAllocator("http3FrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3DataFrame> http3DataFrameAllocator("http3DataFrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3HeadersFrame> http3HeadersFrameAllocator("http3HeadersFrameAllocator");
src/proxy/http3/Http3Frame.cc:ClassAllocator<Http3SettingsFrame> http3SettingsFrameAllocator("http3SettingsFrameAllocator");
$ git grep ClassAllocator src/iocore/net/ | grep -i quic
src/iocore/net/P_QUICNet.h:extern ClassAllocator<QUICPollEvent> quicPollEventAllocator;
src/iocore/net/P_QUICNetVConnection.h:extern ClassAllocator<QUICNetVConnection> quicNetVCAllocator;
src/iocore/net/QUICNet.cc:ClassAllocator<QUICPollEvent> quicPollEventAllocator("quicPollEvent");
src/iocore/net/QUICNetVConnection.cc:ClassAllocator<QUICNetVConnection> quicNetVCAllocator("quicNetVCAllocator");
But I don't think these ae the cause.
The heaviest user of freelist on the benchmark is probably udpPacketAllocator
(and ioBlockAllocator
used by UDPPacket class).
I see, it looks like there are proxy allocators named quic instead of http3:
wkaras ~/LOCAL_REPOS/TS
$ grep -F ProxyAllocator $(findsrc) | grep -Fi quic
./include/iocore/eventsystem/Thread.h: ProxyAllocator quicNetVCAllocator;
./include/iocore/eventsystem/Thread.h: ProxyAllocator quicClientSessionAllocator;
wkaras ~/LOCAL_REPOS/TS
$
Maybe we need a corresponding ProxyAllocator for the udpPacketAllocator class allocator. It would be interesting to compare the benchmark hotspots with and without -f .
Yes, that was suggested a few years ago, and nobody has tried it, even though that doesn't require any protocol knowledge.
I think we are not sure proxy and class allocators improve performance when using a general purpose allocator with per-thread arenas.
Yup, that's one of reasons I stopped using them for new code for H3 and QUIC (using them requires extra code, and there was an issue that constructors/destructors are not called).
Can you please share trafficserver's config (records.yaml) used for this benchmark?
And which branch was used for this benchmark? master or 10.0.x?
did some tests locally in my small nuc, couldn't a very get similar result, but allocation seems a hotspot anyway
https://gist.github.com/brbzull0/19bcd10f135057d66a9540581c8b54b6