Open bryancall opened 2 weeks ago
Here is a perf top from 2 hosts in Denver that are getting ~1600 rps. They look much more similar. ATS 10 is doing 107 rps per cpu percentage and ATS 9 is doing 101 rps per cpu percentage.
ATS 10
2.01% libcrypto.so.1.1 [.] bn_mul_mont
1.74% libcrypto.so.1.1 [.] rsaz_1024_sqr_avx2
1.20% [kernel] [k] do_syscall_64
1.04% libjemalloc.so.2 [.] free
1.04% libwaflz.so [.] ns_waflz::ac::find
0.96% libc-2.28.so [.] __memmove_avx_unaligned_erms
0.81% traffic_server [.] huffman_decode
0.69% libc-2.28.so [.] tolower
0.54% libjemalloc.so.2 [.] malloc
0.52% libpcre2-8.so.0.11.2 [.] pcre2_match_8
0.48% libcrypto.so.1.1 [.] rsaz_1024_mul_avx2
0.46% libcrypto.so.1.1 [.] sha256_block_data_order_avx2
0.46% libcrypto.so.1.1 [.] bn_sqr8x_internal
0.45% traffic_server [.] mime_hdr_field_find
0.44% libwaflz.so [.] ns_waflz::waf::process_rule_part
ATS 9
1.68% libcrypto.so.1.1 [.] bn_mul_mont
1.60% libpcre.so.1.2.13 [.] pcre_exec
1.44% libcrypto.so.1.1 [.] rsaz_1024_sqr_avx2
1.01% [kernel] [k] do_syscall_64
0.91% libc-2.28.so [.] __memmove_avx_unaligned_erms
0.91% libjemalloc.so.2 [.] free
0.88% libwaflz.so [.] ns_waflz::ac::find
0.82% libpcre.so.1.2.13 [.] match
0.78% libpcre2-8.so.0.7.1 [.] pcre2_match_8
0.74% traffic_server [.] huffman_decode
0.65% libc-2.28.so [.] __strncasecmp_l_avx
0.58% libc-2.28.so [.] tolower
0.49% traffic_server [.] _mime_hdr_field_list_search_by_string
0.44% libjemalloc.so.2 [.] malloc
0.42% libcrypto.so.1.1 [.] sha256_block_data_order_avx2
Are all the (likely inlined) calls to ink_get_hrtime() in EThread::execute_regular() in Yahoo ATS 9?
It seems iffy that clock_gettime() is avoiding the kernel call/thunk in RedHat 8. Looks like the COARSE clock is needed ( https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_for_real_time/8/html/optimizing_rhel_8_for_real_time_for_low_latency_operation/assembly_application-timestamping_optimizing-rhel8-for-real-time-for-low-latency-operation#assembly_application-timestamping_optimizing-RHEL8-for-real-time-for-low-latency-operation ). But I don't know why this would affect one YCPI proxy and not another.
Looks like you can select the (potentially not in userspace) clock type with a config, proxy.config.system_clock . (Definitely a contender for most dubiously useful ATS config variable.) So maybe double check it's consistent between the two proxies running ATS 10?
Recommend not assigning this to me. Unless you want to be lower priority than binge-watching Fear The Walking Dead.
Here is a recent perf top from Apple:
8.66% libcrypto.so.1.0.0 [.] _aesni_ctr32_ghash_6x
3.80% [kernel] [k] copy_user_enhanced_fast_string
1.85% [kernel] [k] nft_do_chain
1.65% libc.so.6 [.] __memmove_avx_unaligned_erms
1.25% libcrypto.so.1.0.0 [.] rsaz_1024_sqr_avx2
0.98% libcrypto.so.1.0.0 [.] fe_mul_impl.llvm.7949921813875997555
0.90% [kernel] [k] skb_release_data
0.88% libcrypto.so.1.0.0 [.] sha256_block_data_order_avx
0.81% traffic_server [.] _ZN8PluginVC18process_write_sideEv
0.79% [kernel] [k] pskb_expand_head
0.76% [kernel] [k] _raw_spin_lock
0.74% [kernel] [k] __slab_free
0.67% [kernel] [k] tcp_sendmsg_locked
0.60% [kernel] [k] tcp_clean_rtx_queue.constprop.0
0.60% traffic_server [.] _ZN20Http2ConnectionState15restart_streamsEv
0.59% [kernel] [k] __list_del_entry_valid
0.56% libcrypto.so.1.0.0 [.] X25519
0.56% [kernel] [k] ixgbe_clean_rx_irq
0.54% libjemalloc.so.2 [.] free
0.54% [kernel] [k] csum_partial
0.53% [kernel] [k] tcp_write_xmit
0.51% [kernel] [k] __alloc_skb
0.50% [kernel] [k] __netif_receive_skb_core.constprop.0
0.49% traffic_server [.] _Z19mime_hdr_field_findP11MIMEHdrImplPKci
0.48% [kernel] [k] __tcp_transmit_skb
0.47% [kernel] [k] kfree
0.47% [kernel] [k] ___slab_alloc
0.46% traffic_server [.] _ZN11Http2Stream15restart_sendingEv
0.46% traffic_server [.] _ZN9MutexLockD2Ev
I counted calls to Http2ConnectionState::restart_streams
and found that ATS 10 calls this function 20 times more than 9.2. (so that comes out to 6293121 calls in 10 seconds vs 342508)
We think this increase primarily comes from this change in this PR: https://github.com/apache/trafficserver/pull/11094/files#diff-00a3e1dd2853b5969f857e4e598b30d7e5c8cca51277c2446b9a5bdd62dd8f90R2241-R2243
These are numbers from 2 Ireland hosts, not the busiest time of the day. They are both getting around 850-870 rps. ATS 10 is doing 69 rps per cpu percentage and ATS 9 is doing 130 rps per cpu percentage.
ATS 10
ATS 9