Closed patrick7 closed 5 years ago
SNMP enabled (-M snmp): bgpd uses ~7-50% cpu SNMP enabled (-M snmp) + agentx: bgpd uses 100% cpu
I believe the snmp code is doing something whacko when it receives a snmp request.
Hello, im using FRR 4.0 and sometimes one of virtual CPUs gets utilized 100%. Managed to take some top and perf top output.
top
- 10:37:22 up 21 days, 52 min, 2 users, load average: 0.99, 1.00, 0.64
Tasks: 286 total, 2 running, 284 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 4.2 sy, 0.0 ni, 94.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 32737652 total, 27340504 free, 546144 used, 4851004 buff/cache
KiB Swap: 52428796 total, 52428796 free, 0 used. 29956864 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
**906 frr 15 -5 215372 7756 1772 R 100.3 0.0 160:46.59 bgpd**
91 root 20 0 0 0 0 S 1.3 0.0 491:13.29 ksoftirqd/16
9099 master54 20 0 164204 2528 1612 R 0.7 0.0 0:00.08 top
10 root 20 0 0 0 0 S 0.3 0.0 33:22.54 rcu_sched
15 root 20 0 0 0 0 S 0.3 0.0 452:26.95 ksoftirqd/1
71 root 20 0 0 0 0 S 0.3 0.0 685:20.83 ksoftirqd/12
76 root 20 0 0 0 0 S 0.3 0.0 366:02.75 ksoftirqd/13
86 root 20 0 0 0 0 S 0.3 0.0 1919:36 ksoftirqd/15
96 root 20 0 0 0 0 S 0.3 0.0 600:32.82 ksoftirqd/17
481 root 20 0 0 0 0 S 0.3 0.0 24:21.97 xfsaild/sda3
973 frr 15 -5 65384 5664 1400 S 0.3 0.0 23:40.58 ospfd
1726 root 20 0 224120 11640 6728 S 0.3 0.0 75:34.06 snmpd
3101 root 20 0 0 0 0 S 0.3 0.0 0:00.53 kworker/6:2
8343 root 20 0 0 0 0 S 0.3 0.0 0:00.92 kworker/17:0
47427 root 20 0 0 0 0 S 0.3 0.0 0:47.31 kworker/u128:0
perf top
Samples: 20K of event 'cycles:pp', Event count (approx.): 134699051237035
Overhead Shared Object Symbol
99.99% [kernel] [k] native_write_msr_safe
0.00% [kernel] [k] fget_light
0.00% [kernel] [k] _raw_spin_lock_irqsave
0.00% [kernel] [k] system_call_after_swapgs
0.00% [kernel] [k] fput
0.00% [kernel] [k] __pollwait
0.00% [kernel] [k] unix_poll
0.00% [kernel] [k] sysret_check
0.00% [kernel] [k] _raw_spin_unlock_irqrestore
0.00% [kernel] [k] sock_poll
0.00% [kernel] [k] tcp_poll
0.00% [kernel] [k] do_sys_poll
0.00% [kernel] [k] __x86_indirect_thunk_rax
0.00% [kernel] [k] read_tsc
0.00% libfrr.so.0.0.0 [.] thread_fetch
0.00% [vdso] [.] __vdso_clock_gettime
0.00% [kernel] [k] copy_user_generic_string
0.00% [kernel] [k] select_estimate_accuracy
0.00% [kernel] [k] sha1_transform_avx
0.00% [kernel] [k] add_wait_queue
0.00% libpthread-2.17.so [.] pthread_mutex_lock
0.00% libpthread-2.17.so [.] pthread_mutex_unlock
0.00% [kernel] [k] __audit_syscall_entry
0.00% libpthread-2.17.so [.] pthread_cond_broadcast@@GLIBC_2.3.2
0.00% [kernel] [k] __audit_syscall_exit
0.00% [kernel] [k] remove_wait_queue
0.00% [kernel] [k] sys_poll
0.00% [kernel] [k] __list_add
0.00% [kernel] [k] ktime_get_ts64
0.00% libc-2.17.so [.] 0x00000000000f3f0b
0.00% [kernel] [k] pipe_poll
0.00% [kernel] [k] poll_freewait
0.00% [kernel] [k] __list_del_entry
0.00% [kernel] [k] _aesni_enc1
0.00% [kernel] [k] list_del
0.00% libc-2.17.so [.] __libc_enable_asynccancel
0.00% libc-2.17.so [.] __libc_disable_asynccancel
0.00% [kernel] [k] __x86_indirect_thunk_rdx
0.00% [kernel] [k] get_page_from_freelist
0.00% [kernel] [k] vsnprintf
0.00% [kernel] [k] cpuidle_enter_state
0.00% [kernel] [k] module_get_kallsym
0.00% libc-2.17.so [.] __clock_gettime`
tried to stop and restart snmpd service. - The problem still exists.
I was hoping the perf top would give us some helpful pointers unfortunately it does not, can we get some perf reports of stack traces? https://github.com/FRRouting/frr/wiki/Perf-Recording
No perf reports given and 5.0 is nearing effective EOL anyway. Reopen if still an issue.
I'm seeing this behaviour on bgpd versions 7.0.1 and 6.0.2 both running with -M snmp. When restarting snmpd bgpd's cpu utilization goes down to ~3%. See below for htop and perf top. Flamegraphs produced as outlined in https://github.com/FRRouting/frr/wiki/Perf-Recording:
If more information is needed, please ask.
htop:
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command [113/442]
1466 frr 15 -5 3037M 2861M 6912 R 96.7 8.9 88h26:47 /usr/lib/frr/bgpd -d -A 127.0.0.1 -M snmp
18317 Debian-sn 20 0 78412 51928 6568 S 1.3 0.2 1h00:26 /usr/sbin/snmpd -LS6d -Lf /dev/null -u Debian-snmp -g Debian-snmp -I -smux mteTrigger mteTriggerConf -f
1458 frr 15 -5 1063M 571M 5424 S 0.6 1.8 5h57:35 /usr/lib/frr/zebra -d -A 127.0.0.1 -s 90000000 -M snmp
1467 frr 15 -5 3037M 2861M 6912 S 0.6 8.9 5h24:19 /usr/lib/frr/bgpd -d -A 127.0.0.1 -M snmp
18013 root 20 0 12968 4504 3100 R 0.0 0.0 0:00.05 htop
1470 frr 15 -5 80156 13992 5644 S 0.0 0.0 55:21.22 /usr/lib/frr/ospfd -d -A 127.0.0.1 -M snmp
9503 root 20 0 0 0 0 I 0.0 0.0 0:00.51 kworker/0:1-events_power_efficient
10 root 20 0 0 0 0 I 0.0 0.0 22:41.26 rcu_sched
16165 root 20 0 0 0 0 I 0.0 0.0 0:00.02 kworker/u8:3-bond1
1401 root 15 -5 38296 3248 2064 S 0.0 0.0 4:50.01 /usr/lib/frr/watchfrr -d zebra bgpd ospfd ospf6d staticd
1459 frr 15 -5 1063M 571M 5424 S 0.0 1.8 14:41.71 /usr/lib/frr/zebra -d -A 127.0.0.1 -s 90000000 -M snmp
31468 ntp 20 0 80224 3956 3372 S 0.0 0.0 24:06.20 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 107:111
1 root 20 0 49760 7664 5460 S 0.0 0.0 7:14.45 /sbin/init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.26 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
...
perf top -K --call-graph=dwarf -p $bgpdPID:
Samples: 555 of event 'cycles:ppp', 4000 Hz, Event count (approx.): 258603915
Children Self Shared Object Symbol
+ 72.71% 30.72% libfrr.so.0.0.0 [.] thread_fetch
+ 17.04% 3.21% libc-2.28.so [.] __clock_gettime
+ 15.39% 9.21% libc-2.28.so [.] __poll
+ 14.78% 11.74% [vdso] [.] __vdso_clock_gettime
+ 10.33% 10.33% libpthread-2.28.so [.] __pthread_mutex_lock
+ 5.00% 5.00% libfrr.so.0.0.0 [.] _init
+ 4.16% 4.16% libc-2.28.so [.] 0x00000000001067c7
+ 3.73% 3.73% libfrr.so.0.0.0 [.] do_thread_cancel
+ 3.49% 3.49% libfrr.so.0.0.0 [.] list_delete_all_node
+ 3.38% 0.00% libfrr.so.0.0.0 [.] thread_process_io (inlined)
+ 2.35% 0.00% libfrr.so.0.0.0 [.] fd_poll (inlined)
+ 2.35% 0.00% libfrr.so.0.0.0 [.] poll (inlined)
+ 2.25% 0.00% libpthread-2.28.so [.] __GI___pthread_mutex_lock (inlined)
+ 2.20% 2.16% libpthread-2.28.so [.] __pthread_mutex_unlock_usercnt
+ 1.87% 0.00% libfrr.so.0.0.0 [.] monotime (inlined)
+ 1.60% 1.60% [vdso] [.] 0x00000000000009f7
+ 1.53% 0.91% libfrr.so.0.0.0 [.] thread_fetch
+ 1.12% 0.00% [unknown] [.] 0xfffffe0000006000
+ 1.02% 1.02% libpthread-2.28.so [.] pthread_cond_broadcast@@GLIBC_2.3.2
0.86% 0.86% libc-2.28.so [.] 0x000000000015c6bd
+ 0.84% 0.84% libfrr.so.0.0.0 [.] thread_process_io_helper
0.83% 0.83% libfrr.so.0.0.0 [.] quagga_sigevent_process
+ 0.83% 0.83% [vdso] [.] 0x00000000000009f4
+ 0.81% 0.81% libc-2.28.so [.] 0x00000000001067b8
0.75% 0.75% libc-2.28.so [.] 0x000000000015c798
0.75% 0.00% libfrr.so.0.0.0 [.] thread_timer_wait (inlined)
0.75% 0.00% libfrr.so.0.0.0 [.] monotime_until (inlined)
+ 0.60% 0.60% [vdso] [.] 0x00000000000009f1
+ 0.43% 0.43% libc-2.28.so [.] 0x0000000000106764
+ 0.43% 0.43% libc-2.28.so [.] 0x00000000001067d4
0.38% 0.00% libpthread-2.28.so [.] __pthread_cond_broadcast (inlined)
0.38% 0.00% libfrr.so.0.0.0 [.] thread_process_timers (inlined)
0.38% 0.00% libfrr.so.0.0.0 [.] listnode_free (inlined)
0.37% 0.00% [unknown] [.] 0x0000000007aa25d2
0.37% 0.00% libc-2.28.so [.] 0x00007f155caa17c7
0.37% 0.00% [vdso] [.] 0x00007fff49acd9f7
0.36% 0.36% libc-2.28.so [.] 0x000000000015c7c3
0.36% 0.00% libfrr.so.0.0.0 [.] memcpy (inlined)
0.36% 0.00% libc-2.28.so [.] 0x00007f155caf77c3
+ 0.36% 0.36% libfrr.so.0.0.0 [.] jhash
+ 0.35% 0.35% libc-2.28.so [.] 0x0000000000106758
+ 0.34% 0.34% libfrr.so.0.0.0 [.] thread_trim_head
0.33% 0.33% libc-2.28.so [.] 0x000000000015c7cf
0.33% 0.33% libfrr.so.0.0.0 [.] access_list_apply
0.32% 0.32% libfrr.so.0.0.0 [.] jhash2
0.29% 0.29% libc-2.28.so [.] 0x000000000015c7b4
+ 0.22% 0.22% bgpd [.] peer_process
0.22% 0.22% libc-2.28.so [.] 0x00000000000811f6
+ 0.22% 0.22% libc-2.28.so [.] 0x000000000015c789
+ 0.22% 0.22% libpthread-2.28.so [.] __libc_read
0.19% 0.00% [kernel] [k] __entry_trampoline_start
0.18% 0.18% libfrr.so.0.0.0 [.] hash_iterate
0.17% 0.17% bgpd [.] subgroup_announce_check
Since upgrading to v5 (at this point I also enabled SNMP), I see FRR is using 100% of one CPU core. Also I see the following errors in the log:
The last entry matches the time when the load increased.
Also see https://gist.github.com/patrick7/4c47c3afa6815f25451d0fc4e33c0469