FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.19k stars 1.23k forks source link

ospf6d uses all available memory after interface event #8711

Open jim-p opened 3 years ago

jim-p commented 3 years ago

Interface events (e.g. when reconfigured) in the operating system in certain cases cause ospf6d to consume all available RAM and swap. Eventually the process is killed when it hits its limit or resources run out. Once ospf6d restarts, it returns to normal.

It doesn't seem to happen with bgpd or ospfd as far as I can tell. Thus far I've only been able to make it happen with ospf6d.

There doesn't appear to be a backtrace since ospf6d itself does not crash, the OS kills the process when it fails to allocate more memory.

[X] Did you check if this is a duplicate issue? -- Yes, I saw a couple other issues for memory leaks such as #1725 but none that looked this extreme or fit the other symptoms [ ] Did you test it on the latest FRRouting/frr master branch? -- No, only 7.5.1

Logs from around one occurrence

May 20 09:14:02 hostname watchfrr[38574]: ospf6d state -> up : connect succeeded
May 20 09:14:03 hostname ospf6d[80507]: SPF: Scheduled in 0 msec
May 20 09:14:03 hostname ospf6d[80507]: SPF processing: # Areas: 1, SPF runtime: 0 sec 86 usec, Reason: R+, L+
May 20 09:14:07 hostname ospf6d[80507]: SPF: Scheduled in 0 msec
May 20 09:14:07 hostname ospf6d[80507]: SPF processing: # Areas: 1, SPF runtime: 0 sec 45 usec, Reason: R+, R-, L+
May 20 09:14:07 hostname ospf6d[80507]: SPF: Scheduled in 50 msec
May 20 09:14:07 hostname ospf6d[80507]: SPF processing: # Areas: 1, SPF runtime: 0 sec 71 usec, Reason: R+, R-
May 20 09:14:12 hostname ospf6d[80507]: SPF: Scheduled in 0 msec
May 20 09:14:12 hostname ospf6d[80507]: SPF processing: # Areas: 1, SPF runtime: 0 sec 121 usec, Reason: R+, R-
May 20 09:14:33 hostname ospf6d[80507]: SPF: Scheduled in 0 msec
May 20 09:14:33 hostname ospf6d[80507]: SPF processing: # Areas: 1, SPF runtime: 0 sec 125 usec, Reason: R+, R-
May 20 09:14:33 hostname ospf6d[80507]: Scheduling ipsec2000 for sso
May 20 09:14:33 hostname ospf6d[80507]: SPF: Scheduled in 43 msec
May 20 09:14:36 hostname ospf6d[80507]: SPF processing: # Areas: 1, SPF runtime: 0 sec 47 usec, Reason: L+, L-
May 20 09:15:02 hostname ospf6d[80507]: [EC 100663313] SLOW THREAD: task ospf6_receive (23c6c0) ran for 25260ms (cpu time 3779ms)
May 20 09:15:07 hostname watchfrr[38574]: [EC 268435457] ospf6d state -> unresponsive : no response yet to ping sent 30 seconds ago
May 20 09:15:36 hostname watchfrr[38574]: [EC 268435457] ospf6d state -> down : read returned EOF
May 20 09:15:40 hostname ospf6d[69265]: area_range: for prefix 2001:db8:1:ee30::/64, flag = 0
May 20 09:15:40 hostname ospf6d[69265]: area_range: for prefix 2001:db8:f15:110::/64, flag = 0
May 20 09:15:40 hostname ospf6d[69265]: Scheduling lagg0.4091 for sso
May 20 09:15:40 hostname ospf6d[69265]: Scheduling ipsec2000 for sso
May 20 09:15:40 hostname watchfrr[38574]: ospf6d state -> up : connect succeeded
May 20 09:15:41 hostname ospf6d[69265]: SPF: Scheduled in 0 msec
May 20 09:15:41 hostname ospf6d[69265]: SPF processing: # Areas: 1, SPF runtime: 0 sec 86 usec, Reason: R+, L+
May 20 09:15:47 hostname ospf6d[69265]: SPF: Scheduled in 0 msec
May 20 09:15:47 hostname ospf6d[69265]: SPF processing: # Areas: 1, SPF runtime: 0 sec 44 usec, Reason: R+, R-, L+
May 20 09:15:47 hostname ospf6d[69265]: SPF: Scheduled in 50 msec
May 20 09:15:47 hostname ospf6d[69265]: SPF processing: # Areas: 1, SPF runtime: 0 sec 54 usec, Reason: R+, R-
May 20 09:15:52 hostname ospf6d[69265]: SPF: Scheduled in 0 msec
May 20 09:15:52 hostname ospf6d[69265]: SPF processing: # Areas: 1, SPF runtime: 0 sec 80 usec, Reason: R+, R-

FRR configuration

frr defaults traditional
hostname xxxxxxxxx
password yyyyyyyyy
log syslog
service integrated-vtysh-config
!
ip router-id 10.15.0.1
!
interface lagg0.4091
 ip ospf area 0.0.0.15
 ipv6 ospf6 passive
interface ipsec2000
 ip ospf mtu-ignore
 ip ospf area 0.0.0.0
 ipv6 ospf6 cost 10
 ipv6 ospf6 mtu-ignore
!
router ospf
 ospf router-id 10.15.0.1
 log-adjacency-changes detail
 passive-interface lagg0.4091
 area 0.0.0.0 shortcut default
 area 0.0.0.15 stub
 area 0.0.0.15 shortcut default
 area 0.0.0.15 range 10.15.0.0/16
!
router ospf6
 ospf6 router-id 10.15.0.1
 area 0.0.0.0 range 2001:db8:1:ee30::/64 cost 1
 area 0.0.0.0 range 2001:db8:f15:110::/64 cost 10
 interface lagg0.4091 area 0.0.0.0
 interface ipsec2000 area 0.0.0.0
!
line vty
!
end

In this instance it happens when ipsec2000 has an event (e.g. the IPsec daemon is stopped and started), but the same condition occurs without IPsec when using ospf6d on regular interfaces.

To Reproduce

Configure OSPF6 on an interface, then reconfigure that interface in the operating system in various ways. It doesn't seem to happen for everything, e.g. a simple down/up doesn't trigger it, but restarting IPsec (if bound to an IPsec VTI interface) or reconfiguring the interface addresses seems to do it.

Expected behavior

Expected behavior is for it to recover from the condition without consuming all available memory, as ospfd and bgpd do.

Versions

FreeBSD 12.2-STABLE. Specifically current pfSense development snapshots of either CE 2.6.0 or Plus 21.05. It is not specific to those, but that's the easiest way I've found to reproduce it.

FreeBSD xxxxxxx 12.2-STABLE FreeBSD 12.2-STABLE plus-RELENG_21_05-n202546-902df674b682 pfSense amd64

See above, FreeBSD 12.2-STABLE

FRR version 7.5.1

Additional context

During the event, watching top -a -o res shows ospf6d consuming all the memory and swap on the system, even on a system with 8G RAM and 2G swap. At the start of the event it briefly uses 100% of a CPU core but then the CPU usage settles down and the RAM usage increases. This was copied from the console just before it ran out of memory:

last pid: 47737;  load averages:  0.61,  0.59,  0.47                                                                             up 0+00:26:48  09:15:22
72 processes:  2 running, 70 sleeping
CPU:  0.3% user,  0.0% nice,  8.8% system,  1.7% interrupt, 89.3% idle
Mem: 5551M Active, 644K Inact, 1214M Laundry, 878M Wired, 348K Buf, 216M Free
ARC: 545M Total, 182M MFU, 349M MRU, 118K Anon, 2732K Header, 12M Other
     299M Compressed, 635M Uncompressed, 2.13:1 Ratio
Swap: 2048M Total, 2024M Used, 24M Free, 98% Inuse, 52M In, 61M Out

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
80507 root          1  23    0  8222M  5227M CPU1     1   0:08   9.60% /usr/local/sbin/ospf6d -d

On systems with less RAM, the process dies much sooner since it doesn't take as long for it to consume all of the available resources.

So far I've been able to reproduce the issue on several different systems of varying size, on amd64 and ARM, hardware of various vintages, as well as VMs (ESX and KVM). It doesn't seem specific to anything in the OS or platform that I can tell.

qlyoung commented 3 years ago

@pjdruddy is working on code that might fix this.

Waffle2 commented 2 years ago

I can confirm that this is still an issue. It is happening regularly on my system, with frr-7.5.1