Open gromit1811 opened 3 months ago
I analyzed one failure (with Git 82dcb1d6, slightly newer than the one mentioned above) and I think I know what's happening, but not yet why...
In this case, R5 originated the expected Inter-Router LSA for R7, but R6 didn't (ultimately). Full logs attached:
ospf6d-startup-r5.log ospf6d-startup-r6.log
Events related to the R7 Inter-Router LSA, both from R5 (left) and R6 (right), commented:
# R6 originates the LSA in area 0:
2024/06/12 12:37:51.296041 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.296043 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.296045 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.296131 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.296133 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.296135 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000001 Cksum: 0fe1 Len: 32
2024/06/12 12:37:51.296215 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:51.296216 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.296218 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
# R5 receives this and originates it in area 1 (right now, it doesn't see the direct connection to R7 in area 1 yet):
2024/06/12 12:37:51.299562 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.299563 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.299565 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:51.299566 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.299568 OSPF6: [JJ7N3-VVYV2] Originating summary in area 1 for ASBR 10.254.254.7
2024/06/12 12:37:51.299588 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.299591 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.299593 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000001 Cksum: a623 Len: 32
2024/06/12 12:37:51.299631 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 1
# R5 now sees the direct connection to R7 and originates the LSA in area 0:
2024/06/12 12:37:51.302592 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.302594 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.302596 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.302675 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.302677 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.302679 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000001 Cksum: 15dc Len: 32
2024/06/12 12:37:51.302739 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:51.302741 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.302743 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
# R5 expires its original LSA in area 1 because the path to R7 via area 0 and R6 is no longer the best:
2024/06/12 12:37:51.304010 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.304012 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.304014 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:51.304015 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.304018 OSPF6: [JJ7N3-VVYV2] Originating summary in area 1 for ASBR 10.254.254.7
2024/06/12 12:37:51.304073 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:51.304081 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.304087 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:51.304090 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.304091 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000001 Cksum: a623 Len: 32
# At this time, everything would be as it should.
# But R6 now thinks its LSA in area 0 no longer describes the best path (even though it still does) and expires it. This causes the missing nexthop at R1:
2024/06/12 12:37:51.346949 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.346950 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.346953 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.346954 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:51.346963 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.346978 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:51.346980 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.346982 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000001 Cksum: 0fe1 Len: 32
2024/06/12 12:37:51.347061 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.347063 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
# R6 originates its LSA again, just to expire it again immediately afterwards. This cycle repeats a few times:
2024/06/12 12:37:51.347608 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.347610 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.347612 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.347634 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.347636 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.347638 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000002 Cksum: 0de2 Len: 32
2024/06/12 12:37:51.347742 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:51.347743 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.347798 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:51.347827 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.347829 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.347831 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.347833 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:51.347841 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.347860 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:51.347863 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.347865 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000002 Cksum: 0de2 Len: 32
2024/06/12 12:37:51.347939 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.347940 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:52.126621 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.126622 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.126624 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.126647 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:52.126649 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.126651 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000003 Cksum: 0be3 Len: 32
2024/06/12 12:37:52.126811 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:52.126812 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.126814 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:52.126837 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.126838 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.126841 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.126843 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:52.126850 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.126870 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:52.126873 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.126874 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000003 Cksum: 0be3 Len: 32
2024/06/12 12:37:52.126948 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.126950 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:52.176949 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.176951 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.176953 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.176977 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:52.176979 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.176981 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000004 Cksum: 09e4 Len: 32
2024/06/12 12:37:52.177146 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:52.177147 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.177149 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:52.177172 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.177173 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.177176 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.177177 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:52.177185 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.177205 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:52.177207 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.177209 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000004 Cksum: 09e4 Len: 32
2024/06/12 12:37:52.177282 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.177284 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:54.629330 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:54.629331 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:54.629333 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:54.629356 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:54.629359 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:54.629361 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000005 Cksum: 07e5 Len: 32
2024/06/12 12:37:54.629522 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:54.629523 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:54.629525 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:54.629548 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:54.629550 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:54.629553 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:54.629554 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:54.629562 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:54.629582 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:54.629585 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:54.629587 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000005 Cksum: 07e5 Len: 32
2024/06/12 12:37:54.629661 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:54.629663 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
Next step is to determine what triggers the false "This is the secondary path to the ASBR, ignore" at 12:37:51.346954.
For reference, this is the topo diagram from the test case:
.
Area 0 . Area 1
.
-- R2 ------ R5 -----
/ .\ \
/ . | \
R1 --- R3 ------ R6 ------ R7
\ / |. |
\ / |. |
-- R4 ---- |. |
/ ./
R8 --
.
Humble apologies, @gromit1811. I know why the test was consistently failing on my machine - it was operator error.
I had multiple versions of FRR installed due to poor config management. I changed the configure flags at some point and neglected to specify --sbindir=/usr/lib/frr
, which resulted in binaries in /usr/sbin/
as well. Definitely a trap there for new players.
Now that I've cleaned up my environment and know what I'm running, I ran the ospf6 tests a number of times and noted that this test is mostly passing. I did get an unexpected failure on the third run of the full ospf6d set, and then decided to run this test a few more times. It passed 20 consecutive times - 1/23 failures.
Thanks for the update & no problem - this just means that there isn't an easier way to reproduce this. And since I wasn't able to reproduce your failure rates, I went back to my original environment in the meantime anyway.
Update for everybody else watching: No progress, unfortunately :frowning_face: I can't dedicate any time to this ATM, so this is on hold until further notice.
Back from holidays and other higher prio tasks :wink:
I can now answer my question from above:
Next step is to determine what triggers the false "This is the secondary path to the ASBR, ignore" at 12:37:51.346954.
If the problem occcurs, we seem to have 2 entries for R7 in the brouter_table on the ABR (R6 above, but I've also seen cases where it happened on R5). One is the best path, so the ABR orginates an Inter-Router LSA. But the 2nd one isn't the best, so it expires the LSA again, leaving us with no Inter-Router LSA for R7 on the ABR.
Now, should we have 2 entries in the brouter_table? If this is OK, ospf6_intra_brouter_calculation should call ospf6_abr_originate_summary only for the best one. If this is not OK, we should make sure that only the best route is in brouter_table.
I'll continue investigations, unless somebody can already answer the question above.
Are you seeing two router LSA's (type 1's) for R7, or two externals (type 5's)? These are standard areas, there's no NSSA, etc., here, right? We shouldn't see two router LSA's for a single router in our database. Is there any way to grab these two when they exist to see if they have different serial/sequence numbers, etc.?
I wonder if we're receiving a new LSA and just waiting too long to kill the old one?
One Router-LSA for R7 and IIRC one AS-External-LSA. Standard areas, yes. I've got pretty verbose logs from R5 and R6 when the issue occurs (and for reference also when it doesn't), so I could look up additional things. But: I'm one step further in the meantime and I think I know what's happening, I'm just not 100% sure yet which part of the code is wrong. Sorry for not updating the issue in time, I had the impression I'm mostly talking to myself anyway :wink:
My current understanding: The issue occurs while establishing neighborships. There's a time window of a few milliseconds where not all neighborships are fully established yet and if an ABR receives the wrong LSA exactly at that time, it gets into a weird situation that persists even after all neighborships are OK. If the LSA reception doesn't hit this time window, everything works as expected. That explains why we see this only rarely. And the "wrong LSA" is actually correct at the given time with partial neighborships, but it prevents a "correct LSA" from being originated once neighborships are fully established. "LSA" in this context means Inter-Router-LSAs (type 4), all other types are looking OK.
The gory details: Let's assume the problem occurs on ABR R5 (it could happen as well on ABR R6). In this case, on R1 we only see 2 ECMP paths to R7's loopback address (i.e AS-external routes) instead of the expected 3. That happens because R5 doesn't originate its Inter-Router LSA for R7 towards area 0. R6 does that correctly, so R1 sees only 2 paths: R1-R3-R6-R7 and R1-R4-R6-R7. R1-R2-R5-R7 is missing. Checking the logs a bit closer, we actually see R5 originating its Inter-Router-LSA from time to time, but it expires it immediately afterwards again. This is after neighborships are stable.
What triggers the problem is the following sequence of events:
IMO there are 2 problems: In step 8, the worse LSA from R6 shouldn't "shadow" the one from R5 and shouldn't cause R5 to expire its own one. And in step 11, R5 should re-examine its LSDB and border router table and decide to originate its own LSA again.
The first problem seems to be caused by R5 having 2 routes towards R7 (prefix 10.254.254.7) in its brouter table. Its own one with path type Intra-Area and and the one from R6 with path type Inter-Area. Because they have different types, they use different entries in the brouter table. Now when ospf6_intra_brouter_calculation on R5 iterates the brouter table, it first finds R5's own entry (marked "best"), originates an LSA and in the next iteration, finds the R6 one (not marked "best") and immediately expires its LSA again. I don't know which one of the following variants would be correct:
I don't know yet what's causing the 2nd problem in step 11. Strictly speaking, we have hooks notifying us about brouter table and LSDB updates, but for some reason, they don't seem to be sufficient.
OSPFv2 seems to handle at least the first problem correctly, so I could probably check how it's implemented there. But the implementation seems to be sufficiently different so that they're probably not directly comparable :frowning:
New test case that reproduces this reliably in #16797
Topology:
.
Area 0 . Area 1
.
-- R2 ------ R5 -----
/ . \
/ . \
R1 --- R3 ------ R6 ------ R7
\ / .\
\ / . \
-- R4 ---- . ------R8
.
Link R6-R7 is down initially, so R1 sees a route to R8 only via R6 (with 2 nexthops, because there are 2 ECMP paths to R6). After link up, we expect additional nexthops to R7, but we don't expect nexthops to R8 to change, because even though it is now reachable via R5 as well, the best path is still only via R6.
What actually happens is that after link up, R1 routes to R8 only via R5 and ignores R6, reducing the number of nexthops to 1. This is because R6 expires its Inter-Router LSA towards area 0 after link up. It does that for the same reason that caused the intermittent failure of the original test case: Before link up, R5 creates an Inter-Router LSA for area 1 based on the one received from R6 via area 0. After link up, R6 receives this via area 1, creates a border_router table entry based on that and that shadows the brouter table entry R6 used initially to originate its Inter-Router LSA towards area 0. So R6 expires that LSA and thus R1 no longer can use this path. Note that after R6 has expired its LSA, R5 receives the LS update and expires its LSA for area 1 as well. But even that doesn't resolve the situation on R6.
BTW, we're only looking at AS-external routes here which every router originates (based on loopback addresses redistributed into OSPFv3).
Logs from a test run with #16797 (note that they contain additional log messages, mostly marked with "XXX" - feel free to ignore): r5_ospf6d.log r6_ospf6d.log
Commented LSA events extracted from these logs:
2024/09/11 14:04:46.928182 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:04:46.928193 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:04:46.928200 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000001 Cksum: 13db Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
2024/09/11 14:04:46.940890 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.2%r5-eth0
2024/09/11 14:04:46.940901 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:04:46.940907 OSPF6: [HA4BY-46K3Y] Age: 4 SeqNum: 0x80000001 Cksum: 13db Len: 32
# ^^^ r5_ospf6d.log
2024/09/11 14:04:46.950329 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.4%r6-eth1
2024/09/11 14:04:46.950339 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:04:46.950345 OSPF6: [HA4BY-46K3Y] Age: 4 SeqNum: 0x80000001 Cksum: 13db Len: 32
# ^^^ r6_ospf6d.log
2024/09/11 14:04:50.352100 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:04:50.352104 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:04:50.352107 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000001 Cksum: aa1d Len: 32 (-> area 1)
# ^^^ r5_ospf6d.log
# Link R6-R7 up:
2024/09/11 14:04:53.611915 OSPF6: [TB8WK-36NCK] Interface Event r6-eth2: [InterfaceUp]
2024/09/11 14:05:07.485588 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.7%r6-eth2
2024/09/11 14:05:07.485594 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:07.485598 OSPF6: [HA4BY-46K3Y] Age: 20 SeqNum: 0x80000001 Cksum: aa1d Len: 32
# ^^^ r6_ospf6d.log
# R6 receives R5's area 1 LSA based on its own LSA originated at 14:04:46.928182
2024/09/11 14:05:12.485356 OSPF6: [SYY2T-BK418] LSA Expire:
2024/09/11 14:05:12.485366 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.485372 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000001 Cksum: 13db Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
# ...and that causes it to erroneously expire its own LSA
2024/09/11 14:05:12.489660 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:05:12.489669 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.489676 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000002 Cksum: 11dc Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
# ...just to originate it again 4ms later
2024/09/11 14:05:12.490348 OSPF6: [SYY2T-BK418] LSA Expire:
2024/09/11 14:05:12.490358 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.490363 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000002 Cksum: 11dc Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
# ...and then to expire it one more time immediately afterwards
2024/09/11 14:05:12.509459 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.2%r5-eth0
2024/09/11 14:05:12.509464 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.509467 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000002 Cksum: 11dc Len: 32
# ^^^ r5_ospf6d.log
2024/09/11 14:05:12.509703 OSPF6: [SYY2T-BK418] LSA Expire:
2024/09/11 14:05:12.509830 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.509834 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000001 Cksum: aa1d Len: 32 (-> area 1)
# ^^^ r5_ospf6d.log
# R5 expires its LSA based on the (just expired) LSA from R6
2024/09/11 14:05:12.522146 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.4%r6-eth1
2024/09/11 14:05:12.522149 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.522151 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000002 Cksum: 11dc Len: 32
# ^^^ r6_ospf6d.log
2024/09/11 14:05:12.522193 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.7%r6-eth2
2024/09/11 14:05:12.522196 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.522198 OSPF6: [HA4BY-46K3Y] Age: 3600 SeqNum: 0x80000001 Cksum: aa1d Len: 32
# ^^^ r6_ospf6d.log
2024/09/11 14:05:12.529361 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:05:12.529364 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.529365 OSPF6: [HA4BY-46K3Y] Age: 0 SeqNum: 0x80000001 Cksum: e1f9 Len: 32 (-> area 0)
# ^^^ r5_ospf6d.log
# ...and then creates a proper one for area 0 based on the path to R8 via area 1
2024/09/11 14:05:12.533631 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.3%r6-eth0
2024/09/11 14:05:12.533633 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.533634 OSPF6: [HA4BY-46K3Y] Age: 4 SeqNum: 0x80000001 Cksum: e1f9 Len: 32
# ^^^ r6_ospf6d.log
2024/09/11 14:05:12.533781 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.4%r6-eth1
2024/09/11 14:05:12.533783 OSPF6: [GNRNM-HX7W3] [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.533784 OSPF6: [HA4BY-46K3Y] Age: 4 SeqNum: 0x80000001 Cksum: e1f9 Len: 32
# ^^^ r6_ospf6d.log
The expire/originate/expire cycle is typical for this case: Expire first on LSA reception, then iterate the brouter table and find the intra-area path first (-> originate), followed by the inter-area path from the received LSA (-> expire).
I now also analyzed the 2nd problem mentioned in https://github.com/FRRouting/frr/issues/16197#issuecomment-2293383832 (failure to re-originate the Inter-Router LSA on R6 towards area 0 after the temporary Inter-Router LSA from R5 towards area 1 has been expired):
The expired LSA seems to get handled properly and triggers removal of the corresponding brouter table entry on R6. All this processing is performed based on the route derived from the expired LSA in ospf6_abr_examin_summary()
. But after removal, nobody checks whether there might be another brouter table route for the same ASBR and area (with different path type) left which should now be used to originate an Inter-Router LSA towards area 0 instead. If I hack to code to call ospf6_intra_brouter_calculation()
after handling the expired LSA, R6 will originate a proper Inter-Router LSA towards area 0, fixing the 2nd problem. Or rather, I don't consider this to be a real fix because a full ospf6_intra_brouter_calculation()
is overkill if only a single brouter changed...
Description
Since 217e505a6, topotest ospf6_ecmp_inter_area intermittently fails due to a wrong number of nexthops for certain routes. See comments to #16055 where this was mentioned initially and further discussion in comments to #15899.
Failure rates are ~10% in my tests but vary wildly (I also saw 200 successful runs in a row). @acooks-at-bda reported a 100% failure rate in his tests, but I've been unable to get anywhere near when I tried to reproduce his environment.
When the error occurs, the initial pre-condition nexthop check in https://github.com/FRRouting/frr/blob/d5b0c76eddc8c30b8306ed7f53f6c8d94f4b0c37/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py#L195 fails.
Note: This report is mostly a placeholder to record the fact that I'm investigating this. I can't dedicate too much time to it, so if somebody wants to help or is faster than me, be my guest :wink:
Version
How to reproduce
pytest tests/topotests/ospf6_ecmp_inter_area
Expected behavior
Test succeeds
Actual behavior
Test sometimes fails with errors like this (nexthop pattern is not always exactly the same):
Additional context
The actual issue seems to be that sometimes one of the 2 ABRs (R5 and R6) doesn't originate an Inter-Router LSAs (type 4) when it should, causing a path to the destination (R7) to be lost. I don't know yet why that happens.
Note: The problem is most likely neither caused by 217e505a6 nor by b925570 (the bugfix which the topotest update is trying to verify) but existed before. It was noticed only now just because there was no testcase for inter-area ECMP routes before.
Checklist