EIGRP: Assertion `successors' failed in file eigrp_fsm.c, line 361, function eigrp_fsm_event_nq_fcn

lunn commented 7 years ago

I have a simple setup with a chain of six nodes running EIGRP, using the CORE network simulator. I reliably get an assertion failure when i start the simulation and the nodes start talking to each other.

screenshot_2017-08-09_12-15-32

n4 reliably fails thus:

2017/08/09 11:53:19 EIGRP: Processing Update size[160] int(eth0) nbr(10.0.2.1) seq [38] flags [8] 2017/08/09 11:53:19 EIGRP: EIGRP AS: 64512 State: 0 Event: 0 Network: 10.0.0.0

2017/08/09 11:53:19 EIGRP: Assertion `successors' failed in file eigrp_fsm.c, line 361, function eigrp_fsm_event_nq_fcn 2017/08/09 11:53:19 EIGRP: Backtrace for 11 stack frames: 2017/08/09 11:53:19 EIGRP: [bt 0] /usr/local/lib/libfrr.so.0(zlog_backtrace+0x2b) [0x7fd7c928a5b4] 2017/08/09 11:53:19 EIGRP: [bt 1] /usr/local/lib/libfrr.so.0(_zlog_assert_failed+0x9d) [0x7fd7c928aa27] 2017/08/09 11:53:19 EIGRP: [bt 2] /usr/local/sbin/eigrpd(eigrp_fsm_event_nq_fcn+0x42) [0x55b3b12e65bf] 2017/08/09 11:53:19 EIGRP: [bt 3] /usr/local/sbin/eigrpd(eigrp_fsm_event+0x55) [0x55b3b12e6a32] 2017/08/09 11:53:19 EIGRP: [bt 4] /usr/local/sbin/eigrpd(eigrp_update_receive+0x2dd) [0x55b3b12e845a] 2017/08/09 11:53:19 EIGRP: [bt 5] /usr/local/sbin/eigrpd(eigrp_read+0x6ad) [0x55b3b12e4d1e] 2017/08/09 11:53:19 EIGRP: [bt 6] /usr/local/lib/libfrr.so.0(thread_call+0x46) [0x7fd7c92a256a] 2017/08/09 11:53:19 EIGRP: [bt 7] /usr/local/lib/libfrr.so.0(frr_run+0xab) [0x7fd7c92890b5] 2017/08/09 11:53:19 EIGRP: [bt 8] /usr/local/sbin/eigrpd(main+0x11d) [0x55b3b12ddf16] 2017/08/09 11:53:19 EIGRP: [bt 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7fd7c7f642e1] 2017/08/09 11:53:19 EIGRP: [bt 10] /usr/local/sbin/eigrpd(_start+0x2a) [0x55b3b12de2ca] 2017/08/09 11:53:19 EIGRP: Current thread function eigrp_read, scheduled from file eigrp_packet.c, line 478

Attached is a zipball with logfiles and configuration files for all nodes in the network. Take a look at n4.conf/var.log/eigrpd.log

pycore.43981.zip

lunn commented 7 years ago

As is often the case, i think there are multiple issues here. I've not got to the bottom of this yet, but....

There is an entry in the table node eigrp_topology_get_successor() is searching, but the EIGRP_NEIGHBOR_ENTRY_SUCCESSOR_FLAG flag is not set. The flag was sent when the entry was created in eigrp_update_receive(), however the metric is getting set to EIGRP_MAX_METRIC. This appears to mean that when eigrp_topology_update_node_flags() runs, it clears the EIGRP_NEIGHBOR_ENTRY_SUCCESSOR_FLAG.

There appears to be an assumption in the code that if there is a table, there must be a successor. This is not true when the filtering logic decides to filter out a prefix by setting its metric to EIGRP_MAX_METRIC when adding it to the table. Maybe filtered prefixes should not be added to the list at all?

I'm still digging into why the metric is being set to EIGRP_MAX_METRIC in my case. I don't have any prefix filters set...

donaldsharp commented 7 years ago

@riw777 to take a look it

lunn commented 7 years ago

Here is the patch i ended up with. But see the comment after the --- marker. Ah, this also depends on another patch which did a globe: s/EIGRP_MAX_METRIC/EIGRP_METRIC_UNREACHABLE/g

From 9392fa1e0d59d81acb577d5bbb30ffb65949b08d Mon Sep 17 00:00:00 2001
From: Andrew Lunn <andrew@lunn.ch>
Date: Fri, 11 Aug 2017 14:48:56 -0500
Subject: [PATCH] eigrpd: Don't transition out of passive for unreachable

A neighbour can send an update message with a route of unreachable
metric. Such a route should not take the state machine out of passive,
since such a router is not usable.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---

I'm still stumbling around in the dark with this code base. Somebody
needs to review this change.

I'm not sure this is the correct change. Should we even be adding a
neighbour entry for a route which is unreachable? Maybe it would be
better to simply not add the entry, rather than add it, do nothing,
and then let it be removed the next time
eigrp_update_topology_table_prefix() is called?
---
 eigrpd/eigrp_fsm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/eigrpd/eigrp_fsm.c b/eigrpd/eigrp_fsm.c
index 852362e1921c..0dc95c5abbf5 100644
--- a/eigrpd/eigrp_fsm.c
+++ b/eigrpd/eigrp_fsm.c
@@ -207,6 +207,10 @@ int eigrp_get_fsm_event(struct eigrp_fsm_action_message *msg)
                // zlog_info ("flag: %d rdist: %u dist: %u pfdist: %u pdist:
                // %u", head->flags, head->reported_distance, head->distance,
                // prefix->fdistance, prefix->distance);
+               /* An unreachable route does us no good. Don't leave passive */
+               if (head->reported_distance == EIGRP_METRIC_UNREACHABLE)
+                       return EIGRP_FSM_KEEP_STATE;
+
                if (head->reported_distance < prefix->fdistance) {
                        return EIGRP_FSM_KEEP_STATE;
                }
-- 
2.14.1

donaldsharp commented 7 years ago

@lunn take a look at this branch: https://github.com/donaldsharp/frr/tree/eigrp_cleanup

I've started cleaning up the code so it's more readable. I would think this does not fix anything.

I'll talk w/ @dslicenc in the morning about this. He was more familiar than me with this part of the EIGRP state machine

donaldsharp commented 7 years ago

@lunn This could be a sender issue, can you attach a pcap file for debugging? In any event would you like to join us on slack for easier discussion? If so send an email to sharpd at cumulusnetworks.com and I'll get you an invite.

lunn commented 7 years ago

The RFC suggests such an update can be sent, it indicates the route is unreachable.

I traced this back to what was actually sending the update. It was part of the startup process, indicating to a neighbour it should clear out old state. This can also happen in eigrp_update_receive(), look at the code by / Filtering /

If i remember correctly, Martin Winter did some work to allow you to build topologies of VM nodes running FRR so you can debug these sorts of issues. I would suggest you do that. The topology i used is above, and it was 100% reproducable. I also have the same issue with a pentagon.

Unfortunately, i need to move on. After a week of debugging and playing with EIGRP, it is too unstable to do what i need, and i don't have the time to fix it up. I'm happy to test fixes, but i don't have time to do more than that.

donaldsharp commented 7 years ago

@lunn - https://github.com/FRRouting/frr/pull/1000 This should improve behavior somewhat. Things are still wrong though. The FIFO is actually being used as a FILO in some places which makes it wrong still. I'll work on that next ( hopefully this evening )

donaldsharp commented 7 years ago

@lunn -> check out #1010 it improves the sending of reliable packets

donaldsharp commented 7 years ago

I have just built a 8 node line r1 - r2 - r3 - r4 - r5 - r6 - r7 - r8 and am not seeing this crash robot# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel,

selected route, * - FIB route

E> 192.168.3.0/24 [0/0] via 193.1.7.2, r8-eth1 E> 193.1.1.0/26 [0/0] via 193.1.7.2, r8-eth1 E> 193.1.2.0/24 [0/0] via 193.1.7.2, r8-eth1 E> 193.1.4.0/24 [0/0] via 193.1.7.2, r8-eth1 E> 193.1.5.0/24 [0/0] via 193.1.7.2, r8-eth1 E> 193.1.6.0/24 [0/0] via 193.1.7.2, r8-eth1 E 193.1.7.0/24 [0/0] is directly connected, r8-eth1 C>* 193.1.7.0/24 is directly connected, r8-eth1 robot#

Closing this issue.

FRRouting / frr

EIGRP: Assertion `successors' failed in file eigrp_fsm.c, line 361, function eigrp_fsm_event_nq_fcn #943