Closed dteach-rv closed 4 years ago
The fault happened in the revalidation codepath which is triggered by updates from the rpki side, can you check if that is reproducible? The easiest way to do that is to take gortr and point it to a custom json file. Something like this:
{
"roas": [
{
"prefix": "10.0.0.0/16",
"maxLength": 24,
"asn": "AS64512"
},
{
"prefix": "10.1.0.0/16",
"maxLength": 24,
"asn": "AS64512"
}
]
}
Start gortr with these arguments -cache $PATH_TO_JSON_FILE -verify=false -refresh 1 -checktime=false
.
Now configure frr to use this cache server and add/remove roas to the json file.
You can verify with the vtysh command show rpki prefix-table
that the changes were received.
I furthermore noticed some oddities in the backtrace. It seems like the fault happened because somewhere in bgp_table_range_lookup
it tried to deref a pointer with the value 0x2
. And the libraries referenced in the backtrace seem to be all over the place. I would expect at least the frr libraries and modules to be in the same lib folder. Is it possible that you are mixing frr binaries from different compilation runs or sources?
Probably hit the same issue:
Dec 2 13:01:57 bluepill bgpd[26680]: vty[??]@> enable
Dec 2 13:01:57 bluepill bgpd[26680]: vty[??]@# show bgp ipv4 large-community-list LCL-RPKI-Invalid json
Dec 2 13:01:59 bluepill zebra[6917]: vty[??]@> echo PING
Dec 2 13:01:59 bluepill zebra[6917]: vty[??]@> enable
Dec 2 13:01:59 bluepill bgpd[26680]: vty[??]@> enable
Dec 2 13:01:59 bluepill bgpd[26680]: vty[??]@# show bgp ipv6 large-community-list LCL-RPKI-Invalid json
Dec 2 13:01:59 bluepill zebra[6917]: vty[??]@> enable
Dec 2 13:01:59 bluepill bgpd[26680]: vty[??]@> enable
Dec 2 13:01:59 bluepill bgpd[26680]: vty[??]@# show bgp ipv4 large-community-list LCL-RPKI-Valid json
Dec 2 13:02:04 bluepill zebra[6917]: vty[??]@> echo PING
Dec 2 13:02:09 bluepill zebra[6917]: vty[??]@> echo PING
Dec 2 13:02:11 bluepill bgpd[26680]: [EC 100663313] SLOW COMMAND: command took 11474ms (cpu time 11450ms): show bgp ipv4 large-community-list LCL-RPKI-Valid json
Dec 2 13:02:11 bluepill bgpd[26680]: [EC 100663313] SLOW THREAD: task vtysh_read (7fea23434ab0) ran for 11474ms (cpu time 11454ms)
Dec 2 13:02:11 bluepill bgpd[26680]: vty[??]@> echo PING
Dec 2 13:02:11 bluepill bgpd[26680]: Received signal 11 at 1575288131 (si_addr 0x2, PC 0x56017b5c2e63); aborting...#012<FE>^?
Dec 2 13:02:11 bluepill bgpd[26680]: Backtrace for 11 stack frames:
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x60) [0x7fea233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x10c) [0x7fea233fe8dc]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x724a4) [0x7fea2341f4a4]233fe8dc]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7fea23107730]4a4]233fe8dc]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x63) [0x56017b5c2e63]]4a4]233fe8dc]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x58e8) [0x7fea22bfd8e8]]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x56) [0x7fea2342cba6]]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7fea233fc668]ba6]]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/frr/bgpd(main+0x335) [0x56017b567ba5]run+0xd8) [0x7fea233fc668]ba6]]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7fea22f5809b]ba6]]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: /usr/lib/frr/bgpd(_start+0x2a) [0x56017b5693ea]in+0xeb) [0x7fea22f5809b]ba6]]a233fe460]
Dec 2 13:02:11 bluepill bgpd[26680]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:351#012c2e63); aborting...#012<FE>^?
Dec 2 13:02:11 bluepill watchfrr[6900]: [EC 268435457] bgpd state -> down : read returned EOF
Dec 2 13:02:11 bluepill zebra[6917]: [EC 4043309120] Client 'bgp' encountered an error and is shutting down.
Dec 2 13:02:11 bluepill zebra[6917]: [EC 4043309120] Client 'vnc' encountered an error and is shutting down.
Dec 2 13:02:11 bluepill zebra[6917]: zebra/zebra_ptm.c:1345 failed to find process pid registration
Dec 2 13:02:12 bluepill zebra[6917]: client 19 disconnected. 865213 bgp routes removed from the rib
Dec 2 13:02:12 bluepill zebra[6917]: client 32 disconnected. 0 vnc routes removed from the rib
Dec 2 13:02:14 bluepill zebra[6917]: vty[??]@> echo PING
Dec 2 13:02:16 bluepill watchfrr[6900]: [EC 100663303] Forked background command [pid 28171]: /usr/lib/frr/watchfrr.sh restart bgpd
Hey,
Thanks for the quick response! Will this fix get back-ported to the stable branches? I'm happy to test if we can get it into stable/7.2.
I am not sure yet that the problem has been fixed experiments are still going on.
okay, I'll keep an eye out here. Thanks.
@dteach-rv I went and looked at the test bed I was using and decided that this was good enough and pushed a PR for 7.2 into #5454 Take a look at that.
great. I'll get this built tomorrow and see how it goes.
Backtrace:
Daemons config:
Config:
Issue: FRR backtraces when configured with RPKI validators. I haven't been able to identify a specific time period or condition that causes it.
(put "x" in "[ ]" if you already tried following) [x] Did you check if this is a duplicate issue? [ ] Did you test it on the latest FRRouting/frr master branch?
To Reproduce Steps to reproduce the behavior: I haven't been able to narrow down the specific steps to reproduce. We are running two different versions of RPKI validators: NLnet Labs routinator https://github.com/NLnetLabs/routinator and RIPE-NCC https://github.com/RIPE-NCC/rpki-validator-3/wiki
Expected behavior No backtrace on RPKI deployment.
Screenshots If applicable, add screenshots to help explain your problem.
Versions
Additional context Add any other context about the problem here.