FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.31k stars 1.25k forks source link

Backtrace in bgpd/rpki #5440

Closed dteach-rv closed 4 years ago

dteach-rv commented 4 years ago

Backtrace:

Nov 24 12:22:38 frr bgpd[20257]: Received signal 11 at 1574616158 (si_addr 0x2, PC 0x46cdc3); aborting...
Nov 24 12:22:38 frr bgpd[20257]: Backtrace for 11 stack frames:
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7fd1ad445957]
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libfrr.so.0(zlog_signal+0x113) [0x7fd1ad445db3]1ad445957]
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libfrr.so.0(+0x70e65) [0x7fd1ad465e65]ad445db3]1ad445957]
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libpthread.so.0(+0xf5f0) [0x7fd1abd605f0]45db3]1ad445957]
Nov 24 12:22:38 frr bgpd[20257]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x63) [0x46cdc3]445957]
Nov 24 12:22:38 frr bgpd[20257]: /usr/lib64/frr/modules/bgpd_rpki.so(+0x4f0d) [0x7fd1a934ff0d]57]
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libfrr.so.0(thread_call+0x60) [0x7fd1ad4736e0]934ff0d]57]
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libfrr.so.0(frr_run+0x128) [0x7fd1ad443ab8]e0]934ff0d]57]
Nov 24 12:22:38 frr bgpd[20257]: /usr/lib/frr/bgpd(main+0x2e3) [0x41c043]1ad443ab8]e0]934ff0d]57]
Nov 24 12:22:38 frr bgpd[20257]: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd1ab9a5505]f0d]57]
Nov 24 12:22:38 frr bgpd[20257]: /usr/lib/frr/bgpd() [0x41d9bb]main+0xf5) [0x7fd1ab9a5505]f0d]57]
Nov 24 12:22:38 frr bgpd[20257]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:351#012; aborting...
Nov 24 12:22:38 frr watchfrr[6779]: [EC 268435457] bgpd state -> down : read returned EOF
Nov 24 12:22:38 frr zebra[5952]: [EC 4043309116] Client 'bgp' encountered an error and is shutting down.
Nov 24 12:22:38 frr zebra[5952]: zebra/zebra_ptm.c:1345 failed to find process pid registration
Nov 24 12:22:38 frr zebra[5952]: client 15 disconnected. 0 bgp routes removed from the rib

Daemons config:

bgpd=yes
ospfd=no
ospf6d=no
ripd=no
ripngd=no
isisd=no
pimd=no
ldpd=no
nhrpd=no
eigrpd=no
babeld=no
sharpd=no
pbrd=no
bfdd=no
fabricd=no
vrrpd=no

#
# If this option is set the /etc/init.d/frr script automatically loads
# the config via "vtysh -b" when the servers are started.
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
vtysh_enable=yes
zebra_options="  -A 127.0.0.1 -s 90000000"
bgpd_options="   -P 23 -Z -M bmp -M rpki"
ospfd_options="  -A 127.0.0.1"
ospf6d_options=" -A ::1"
ripd_options="   -A 127.0.0.1"
ripngd_options=" -A ::1"
isisd_options="  -A 127.0.0.1"
pimd_options="   -A 127.0.0.1"
ldpd_options="   -A 127.0.0.1"
nhrpd_options="  -A 127.0.0.1"
eigrpd_options=" -A 127.0.0.1"
babeld_options=" -A 127.0.0.1"
sharpd_options=" -A 127.0.0.1"
pbrd_options="   -A 127.0.0.1"
staticd_options="-A 127.0.0.1"
bfdd_options="   -A 127.0.0.1"
fabricd_options="-A 127.0.0.1"
vrrpd_options="  -A 127.0.0.1"

Config:

Building configuration...

Current configuration:
!
frr version 7.2-rv
frr defaults traditional
hostname frr.routeviews.org
log syslog
log facility local5
service password-encryption
no ip forwarding
no ipv6 forwarding
log commands
service password-encryption
dump bgp updates-et /mnt/storage/bgpdata/%Y.%m/UPDATES/updates.%Y%m%d.%H%M 15m
dump bgp routes-mrt /mnt/storage/bgpdata/%Y.%m/RIBS/rib.%Y%m%d.%H%M 2h
rpki
  rpki polling_period 3600
  rpki cache 184.171.101.187 3323 preference 1
  rpki cache 184.171.101.188 3323 preference 2
  exit
service password-encryption
!
!
ip route 128.223.253.10/32 128.223.51.1 eno2
ip route 128.223.253.9/32 128.223.51.1 eno2
ip route 128.223.51.106/32 128.223.51.1 eno2
ip route 128.223.51.74/32 128.223.51.1 eno2
ip route 128.223.51.75/32 128.223.51.1 eno2
ip route 128.223.51.76/32 128.223.51.1 eno2
ip route 128.223.51.78/32 128.223.51.1 eno2
ipv6 route 2001:468:d01:33::80df:106/128 2001:468:d01:33::1 eno2
ipv6 route 2001:468:d01:33::80df:74/128 2001:468:d01:33::1 eno2
ipv6 route 2001:468:d01:33::80df:75/128 2001:468:d01:33::1 eno2
ipv6 route 2001:468:d01:33::80df:76/128 2001:468:d01:33::1 eno2
ipv6 route 2001:468:d01:33::80df:78/128 2001:468:d01:33::1 eno2
ipv6 route 2001:468:d01:fd::9/128 2001:468:d01:33::1 eno2
ipv6 route 2001:468:d01:fd::a/128 2001:468:d01:33::1 eno2
!
router bgp 65123
 bgp router-id 128.223.51.23
 bgp log-neighbor-changes
 no bgp default ipv4-unicast
 bgp disable-ebgp-connected-route-check
 neighbor 128.223.253.9 remote-as 3582
 neighbor 128.223.253.9 description uonet9-gw
 neighbor 128.223.253.9 ebgp-multihop 10
 neighbor 128.223.253.10 remote-as 3582
 neighbor 128.223.253.10 description uonet10-gw
 neighbor 128.223.253.10 ebgp-multihop 10
 neighbor 2001:468:d01:fd::9 remote-as 3582
 neighbor 2001:468:d01:fd::9 description uonet9-gw
 neighbor 2001:468:d01:fd::9 ebgp-multihop 10
 neighbor 2001:468:d01:fd::a remote-as 3582
 neighbor 2001:468:d01:fd::a description uonet10-gw
 neighbor 2001:468:d01:fd::a ebgp-multihop 10
 !
 address-family ipv4 unicast
  neighbor 128.223.253.9 activate
  neighbor 128.223.253.9 route-map nothing out
  neighbor 128.223.253.10 activate
  neighbor 128.223.253.10 route-map nothing out
  table-map nothing
 exit-address-family
 !
 address-family ipv4 multicast
  table-map nothing
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor 2001:468:d01:fd::9 activate
  neighbor 2001:468:d01:fd::9 route-map nothing out
  neighbor 2001:468:d01:fd::a activate
  neighbor 2001:468:d01:fd::a route-map nothing out
  table-map nothing
 exit-address-family
 !
 bmp targets rvbmp
  bmp stats interval 900000
  bmp monitor ipv4 unicast post-policy
  bmp monitor ipv6 unicast post-policy
  bmp connect bmp1.routeviews.org port 5000 min-retry 5000 max-retry 15000
  bmp connect bmp2.routeviews.org port 5000 min-retry 5000 max-retry 15000
!
route-map invalid permit 1
 match rpki invalid
!
route-map notfound permit 1
 match rpki notfound
!
route-map nothing deny 1
!
route-map valid permit 1
 match rpki valid
!
line vty
 no login
!
end

Issue: FRR backtraces when configured with RPKI validators. I haven't been able to identify a specific time period or condition that causes it.


(put "x" in "[ ]" if you already tried following) [x] Did you check if this is a duplicate issue? [ ] Did you test it on the latest FRRouting/frr master branch?

To Reproduce Steps to reproduce the behavior: I haven't been able to narrow down the specific steps to reproduce. We are running two different versions of RPKI validators: NLnet Labs routinator https://github.com/NLnetLabs/routinator and RIPE-NCC https://github.com/RIPE-NCC/rpki-validator-3/wiki

Expected behavior No backtrace on RPKI deployment.

Screenshots If applicable, add screenshots to help explain your problem.

Versions

Additional context Add any other context about the problem here.

mroethke commented 4 years ago

The fault happened in the revalidation codepath which is triggered by updates from the rpki side, can you check if that is reproducible? The easiest way to do that is to take gortr and point it to a custom json file. Something like this:

{
  "roas": [
    {
      "prefix": "10.0.0.0/16",
      "maxLength": 24,
      "asn": "AS64512"
    },
    {
      "prefix": "10.1.0.0/16",
      "maxLength": 24,
      "asn": "AS64512"
    }
  ]
}

Start gortr with these arguments -cache $PATH_TO_JSON_FILE -verify=false -refresh 1 -checktime=false. Now configure frr to use this cache server and add/remove roas to the json file. You can verify with the vtysh command show rpki prefix-table that the changes were received.

I furthermore noticed some oddities in the backtrace. It seems like the fault happened because somewhere in bgp_table_range_lookup it tried to deref a pointer with the value 0x2. And the libraries referenced in the backtrace seem to be all over the place. I would expect at least the frr libraries and modules to be in the same lib folder. Is it possible that you are mixing frr binaries from different compilation runs or sources?

crami commented 4 years ago

Probably hit the same issue:

Dec  2 13:01:57 bluepill bgpd[26680]: vty[??]@> enable
Dec  2 13:01:57 bluepill bgpd[26680]: vty[??]@# show bgp ipv4 large-community-list LCL-RPKI-Invalid json
Dec  2 13:01:59 bluepill zebra[6917]: vty[??]@> echo PING
Dec  2 13:01:59 bluepill zebra[6917]: vty[??]@> enable
Dec  2 13:01:59 bluepill bgpd[26680]: vty[??]@> enable
Dec  2 13:01:59 bluepill bgpd[26680]: vty[??]@# show bgp ipv6 large-community-list LCL-RPKI-Invalid json
Dec  2 13:01:59 bluepill zebra[6917]: vty[??]@> enable
Dec  2 13:01:59 bluepill bgpd[26680]: vty[??]@> enable
Dec  2 13:01:59 bluepill bgpd[26680]: vty[??]@# show bgp ipv4 large-community-list LCL-RPKI-Valid json
Dec  2 13:02:04 bluepill zebra[6917]: vty[??]@> echo PING
Dec  2 13:02:09 bluepill zebra[6917]: vty[??]@> echo PING
Dec  2 13:02:11 bluepill bgpd[26680]: [EC 100663313] SLOW COMMAND: command took 11474ms (cpu time 11450ms): show bgp ipv4 large-community-list LCL-RPKI-Valid json
Dec  2 13:02:11 bluepill bgpd[26680]: [EC 100663313] SLOW THREAD: task vtysh_read (7fea23434ab0) ran for 11474ms (cpu time 11454ms)
Dec  2 13:02:11 bluepill bgpd[26680]: vty[??]@> echo PING
Dec  2 13:02:11 bluepill bgpd[26680]: Received signal 11 at 1575288131 (si_addr 0x2, PC 0x56017b5c2e63); aborting...#012<FE>^?
Dec  2 13:02:11 bluepill bgpd[26680]: Backtrace for 11 stack frames:
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x60) [0x7fea233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x10c) [0x7fea233fe8dc]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x724a4) [0x7fea2341f4a4]233fe8dc]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fea23107730]4a4]233fe8dc]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x63) [0x56017b5c2e63]]4a4]233fe8dc]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x58e8) [0x7fea22bfd8e8]]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x56) [0x7fea2342cba6]]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7fea233fc668]ba6]]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/frr/bgpd(main+0x335) [0x56017b567ba5]run+0xd8) [0x7fea233fc668]ba6]]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fea22f5809b]ba6]]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: /usr/lib/frr/bgpd(_start+0x2a) [0x56017b5693ea]in+0xeb) [0x7fea22f5809b]ba6]]a233fe460]
Dec  2 13:02:11 bluepill bgpd[26680]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:351#012c2e63); aborting...#012<FE>^?
Dec  2 13:02:11 bluepill watchfrr[6900]: [EC 268435457] bgpd state -> down : read returned EOF
Dec  2 13:02:11 bluepill zebra[6917]: [EC 4043309120] Client 'bgp' encountered an error and is shutting down.
Dec  2 13:02:11 bluepill zebra[6917]: [EC 4043309120] Client 'vnc' encountered an error and is shutting down.
Dec  2 13:02:11 bluepill zebra[6917]: zebra/zebra_ptm.c:1345 failed to find process pid registration
Dec  2 13:02:12 bluepill zebra[6917]: client 19 disconnected. 865213 bgp routes removed from the rib
Dec  2 13:02:12 bluepill zebra[6917]: client 32 disconnected. 0 vnc routes removed from the rib
Dec  2 13:02:14 bluepill zebra[6917]: vty[??]@> echo PING
Dec  2 13:02:16 bluepill watchfrr[6900]: [EC 100663303] Forked background command [pid 28171]: /usr/lib/frr/watchfrr.sh restart bgpd
dteach-rv commented 4 years ago

Hey,

Thanks for the quick response! Will this fix get back-ported to the stable branches? I'm happy to test if we can get it into stable/7.2.

donaldsharp commented 4 years ago

I am not sure yet that the problem has been fixed experiments are still going on.

dteach-rv commented 4 years ago

okay, I'll keep an eye out here. Thanks.

donaldsharp commented 4 years ago

@dteach-rv I went and looked at the test bed I was using and decided that this was good enough and pushed a PR for 7.2 into #5454 Take a look at that.

dteach-rv commented 4 years ago

great. I'll get this built tomorrow and see how it goes.