apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.8k stars 796 forks source link

ATS 10 vs 9 perf top in production #11749

Open bryancall opened 2 weeks ago

bryancall commented 2 weeks ago

These are numbers from 2 Ireland hosts, not the busiest time of the day. They are both getting around 850-870 rps. ATS 10 is doing 69 rps per cpu percentage and ATS 9 is doing 130 rps per cpu percentage.

ATS 10

  13.91%  traffic_server                                           [.] EThread::execute_regular
   3.63%  [kernel]                                                 [k] sort
   2.33%  [kernel]                                                 [k] cmppid
   1.98%  [kernel]                                                 [k] __task_pid_nr_ns
   1.70%  [kernel]                                                 [k] __x86_indirect_thunk_r12
   1.54%  [kernel]                                                 [k] u32_swap
   1.53%  libcrypto.so.1.1                                         [.] __ecp_nistz256_mul_montx
   1.07%  [kernel]                                                 [k] css_task_iter_next
   1.04%  libcrypto.so.1.1                                         [.] __ecp_nistz256_sqr_montx
   1.00%  [kernel]                                                 [k] __x86_indirect_thunk_rax
   0.93%  libjose.so.0.0.0                                         [.] jose_b64_dec_buf
   0.76%  libwaflz.so                                              [.] ns_waflz::ac::find
   0.62%  libc-2.28.so                                             [.] tolower
   0.54%  libjemalloc.so.2                                         [.] free
   0.53%  [kernel]                                                 [k] native_queued_spin_lock_slowpath
   0.52%  libc-2.28.so                                             [.] __memmove_avx_unaligned_erms

ATS 9

   2.37%  libcrypto.so.1.1                                         [.] __ecp_nistz256_mul_montx
   1.62%  libcrypto.so.1.1                                         [.] __ecp_nistz256_sqr_montx
   1.39%  libjose.so.0.0.0                                         [.] jose_b64_dec_buf
   1.20%  libwaflz.so                                              [.] ns_waflz::ac::find
   0.95%  libc-2.28.so                                             [.] tolower
   0.93%  libjemalloc.so.2                                         [.] free
   0.91%  libpcre.so.1.2.13                                        [.] pcre_exec
   0.87%  libpcre2-8.so.0.7.1                                      [.] pcre2_match_8
   0.84%  libcrypto.so.1.1                                         [.] bn_mul_mont
   0.82%  libc-2.28.so                                             [.] __memmove_avx_unaligned_erms
   0.65%  libcrypto.so.1.1                                         [.] bn_sqrx8x_internal
   0.62%  [kernel]                                                 [k] cpuidle_enter_state
   0.61%  libpcre.so.1.2.13                                        [.] match
   0.57%  [kernel]                                                 [k] menu_select
   0.53%  libjemalloc.so.2                                         [.] malloc
   0.52%  libc-2.28.so                                             [.] __strncasecmp_l_avx
bryancall commented 2 weeks ago

Here is a perf top from 2 hosts in Denver that are getting ~1600 rps. They look much more similar. ATS 10 is doing 107 rps per cpu percentage and ATS 9 is doing 101 rps per cpu percentage.

ATS 10

   2.01%  libcrypto.so.1.1                                         [.] bn_mul_mont
   1.74%  libcrypto.so.1.1                                         [.] rsaz_1024_sqr_avx2
   1.20%  [kernel]                                                 [k] do_syscall_64
   1.04%  libjemalloc.so.2                                         [.] free
   1.04%  libwaflz.so                                              [.] ns_waflz::ac::find
   0.96%  libc-2.28.so                                             [.] __memmove_avx_unaligned_erms
   0.81%  traffic_server                                           [.] huffman_decode
   0.69%  libc-2.28.so                                             [.] tolower
   0.54%  libjemalloc.so.2                                         [.] malloc
   0.52%  libpcre2-8.so.0.11.2                                     [.] pcre2_match_8
   0.48%  libcrypto.so.1.1                                         [.] rsaz_1024_mul_avx2
   0.46%  libcrypto.so.1.1                                         [.] sha256_block_data_order_avx2
   0.46%  libcrypto.so.1.1                                         [.] bn_sqr8x_internal
   0.45%  traffic_server                                           [.] mime_hdr_field_find
   0.44%  libwaflz.so                                              [.] ns_waflz::waf::process_rule_part

ATS 9

   1.68%  libcrypto.so.1.1                                         [.] bn_mul_mont
   1.60%  libpcre.so.1.2.13                                        [.] pcre_exec
   1.44%  libcrypto.so.1.1                                         [.] rsaz_1024_sqr_avx2
   1.01%  [kernel]                                                 [k] do_syscall_64
   0.91%  libc-2.28.so                                             [.] __memmove_avx_unaligned_erms
   0.91%  libjemalloc.so.2                                         [.] free
   0.88%  libwaflz.so                                              [.] ns_waflz::ac::find
   0.82%  libpcre.so.1.2.13                                        [.] match
   0.78%  libpcre2-8.so.0.7.1                                      [.] pcre2_match_8
   0.74%  traffic_server                                           [.] huffman_decode
   0.65%  libc-2.28.so                                             [.] __strncasecmp_l_avx
   0.58%  libc-2.28.so                                             [.] tolower
   0.49%  traffic_server                                           [.] _mime_hdr_field_list_search_by_string
   0.44%  libjemalloc.so.2                                         [.] malloc
   0.42%  libcrypto.so.1.1                                         [.] sha256_block_data_order_avx2
ywkaras commented 1 week ago

Are all the (likely inlined) calls to ink_get_hrtime() in EThread::execute_regular() in Yahoo ATS 9?

It seems iffy that clock_gettime() is avoiding the kernel call/thunk in RedHat 8. Looks like the COARSE clock is needed ( https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_for_real_time/8/html/optimizing_rhel_8_for_real_time_for_low_latency_operation/assembly_application-timestamping_optimizing-rhel8-for-real-time-for-low-latency-operation#assembly_application-timestamping_optimizing-RHEL8-for-real-time-for-low-latency-operation ). But I don't know why this would affect one YCPI proxy and not another.

Looks like you can select the (potentially not in userspace) clock type with a config, proxy.config.system_clock . (Definitely a contender for most dubiously useful ATS config variable.) So maybe double check it's consistent between the two proxies running ATS 10?

ywkaras commented 1 week ago

Recommend not assigning this to me. Unless you want to be lower priority than binge-watching Fear The Walking Dead.

cmcfarlen commented 1 week ago

Here is a recent perf top from Apple:

   8.66%  libcrypto.so.1.0.0                               [.] _aesni_ctr32_ghash_6x
   3.80%  [kernel]                                         [k] copy_user_enhanced_fast_string
   1.85%  [kernel]                                         [k] nft_do_chain
   1.65%  libc.so.6                                        [.] __memmove_avx_unaligned_erms
   1.25%  libcrypto.so.1.0.0                               [.] rsaz_1024_sqr_avx2
   0.98%  libcrypto.so.1.0.0                               [.] fe_mul_impl.llvm.7949921813875997555
   0.90%  [kernel]                                         [k] skb_release_data
   0.88%  libcrypto.so.1.0.0                               [.] sha256_block_data_order_avx
   0.81%  traffic_server                                   [.] _ZN8PluginVC18process_write_sideEv
   0.79%  [kernel]                                         [k] pskb_expand_head
   0.76%  [kernel]                                         [k] _raw_spin_lock
   0.74%  [kernel]                                         [k] __slab_free
   0.67%  [kernel]                                         [k] tcp_sendmsg_locked
   0.60%  [kernel]                                         [k] tcp_clean_rtx_queue.constprop.0
   0.60%  traffic_server                                   [.] _ZN20Http2ConnectionState15restart_streamsEv
   0.59%  [kernel]                                         [k] __list_del_entry_valid
   0.56%  libcrypto.so.1.0.0                               [.] X25519
   0.56%  [kernel]                                         [k] ixgbe_clean_rx_irq
   0.54%  libjemalloc.so.2                                 [.] free
   0.54%  [kernel]                                         [k] csum_partial
   0.53%  [kernel]                                         [k] tcp_write_xmit
   0.51%  [kernel]                                         [k] __alloc_skb
   0.50%  [kernel]                                         [k] __netif_receive_skb_core.constprop.0
   0.49%  traffic_server                                   [.] _Z19mime_hdr_field_findP11MIMEHdrImplPKci
   0.48%  [kernel]                                         [k] __tcp_transmit_skb
   0.47%  [kernel]                                         [k] kfree
   0.47%  [kernel]                                         [k] ___slab_alloc
   0.46%  traffic_server                                   [.] _ZN11Http2Stream15restart_sendingEv
   0.46%  traffic_server                                   [.] _ZN9MutexLockD2Ev

I counted calls to Http2ConnectionState::restart_streams and found that ATS 10 calls this function 20 times more than 9.2. (so that comes out to 6293121 calls in 10 seconds vs 342508)

We think this increase primarily comes from this change in this PR: https://github.com/apache/trafficserver/pull/11094/files#diff-00a3e1dd2853b5969f857e4e598b30d7e5c8cca51277c2446b9a5bdd62dd8f90R2241-R2243