FRRouting / frr

The FRRouting Protocol Suite
https://frrouting.org/
Other
3.21k stars 1.24k forks source link

Missing ECMP nexthops for OSPFv3 inter-area routes #16197

Open gromit1811 opened 3 months ago

gromit1811 commented 3 months ago

Description

Since 217e505a6, topotest ospf6_ecmp_inter_area intermittently fails due to a wrong number of nexthops for certain routes. See comments to #16055 where this was mentioned initially and further discussion in comments to #15899.

Failure rates are ~10% in my tests but vary wildly (I also saw 200 successful runs in a row). @acooks-at-bda reported a 100% failure rate in his tests, but I've been unable to get anywhere near when I tried to reproduce his environment.

When the error occurs, the initial pre-condition nexthop check in https://github.com/FRRouting/frr/blob/d5b0c76eddc8c30b8306ed7f53f6c8d94f4b0c37/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py#L195 fails.

Note: This report is mostly a placeholder to record the fact that I'm investigating this. I can't dedicate too much time to it, so if somebody wants to help or is faster than me, be my guest :wink:

Version

Git master 217e505a67df1ac03483f7c9a97cf4947dd40707

How to reproduce

pytest tests/topotests/ospf6_ecmp_inter_area

Expected behavior

Test succeeds

Actual behavior

Test sometimes fails with errors like this (nexthop pattern is not always exactly the same):

======================================= test session starts ========================================
platform linux -- Python 3.12.3, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/frr/frr/tests/topotests
configfile: pytest.ini
collected 3 items                                                                                  

tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py .Fs                      [100%]

============================================= FAILURES =============================================
_______________________________________ test_ecmp_inter_area _______________________________________

    def test_ecmp_inter_area():
        "Test whether OSPFv3 ECMP nexthops are properly updated for inter-area routes after link down"
        tgen = get_topogen()
        if tgen.routers_have_failure():
            pytest.skip(tgen.errors)

        def num_nexthops(router):
            # Careful: "show ipv6 ospf6 route json" doesn't work here. It will
            # only list one route type per prefix and that might not necessarily
            # be the best/selected route. "show ipv6 route ospf6 json" only
            # lists selected routes, so that's more useful in this case.
            routes = tgen.gears[router].vtysh_cmd("show ipv6 route ospf6 json", isjson=True)
            route_prefixes_infos = sorted(routes.items())
            # Note: ri may contain one entry per routing protocol, but since
            # we've explicitly requested only ospf6 above, we can count on ri[0]
            # being the entry we're looking for.
            return [ri[0]["internalNextHopActiveNum"] for rp, ri in route_prefixes_infos]

        def expect_num_nexthops(router, expected_num_nexthops, count):
            "Wait until number of nexthops for routes matches expectation"
            logger.info(
                "waiting for OSPFv3 router '{}' nexthops {}".format(
                    router, expected_num_nexthops
                )
            )
            test_func = partial(num_nexthops, router)
            _, result = topotest.run_and_expect(
                test_func, expected_num_nexthops, count=count, wait=3
            )
            assert (
                result == expected_num_nexthops
            ), "'{}' wrong number of route nexthops".format(router)

        # Check nexthops pre link-down
        # tgen.mininet_cli()
>       expect_num_nexthops("r1", [1, 1, 1, 1, 2, 3, 3, 3, 3], 4)

/home/frr/frr/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py:195: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

router = 'r1', expected_num_nexthops = [1, 1, 1, 1, 2, 3, ...], count = 4

    def expect_num_nexthops(router, expected_num_nexthops, count):
        "Wait until number of nexthops for routes matches expectation"
        logger.info(
            "waiting for OSPFv3 router '{}' nexthops {}".format(
                router, expected_num_nexthops
            )
        )
        test_func = partial(num_nexthops, router)
        _, result = topotest.run_and_expect(
            test_func, expected_num_nexthops, count=count, wait=3
        )
>       assert (
            result == expected_num_nexthops
        ), "'{}' wrong number of route nexthops".format(router)
E       AssertionError: 'r1' wrong number of route nexthops
E       assert [1, 1, 1, 1, 2, 1, ...] == [1, 1, 1, 1, 2, 3, ...]
E         At index 5 diff: 1 != 3
E         Use -v to get more diff

/home/frr/frr/tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py:189: AssertionError
---------------------------------------- Captured log call -----------------------------------------
2024-06-11 18:11:15,253 ERROR: topo: 'num_nexthops' failed after 12.30 seconds
------------------------- generated xml file: /tmp/topotests/topotests.xml -------------------------
===================================== short test summary info ======================================
FAILED tests/topotests/ospf6_ecmp_inter_area/test_ospf6_ecmp_inter_area.py::test_ecmp_inter_area - AssertionError: 'r1' wrong number of route nexthops
============================= 1 failed, 1 passed, 1 skipped in 34.32s ==============================

Additional context

The actual issue seems to be that sometimes one of the 2 ABRs (R5 and R6) doesn't originate an Inter-Router LSAs (type 4) when it should, causing a path to the destination (R7) to be lost. I don't know yet why that happens.

Note: The problem is most likely neither caused by 217e505a6 nor by b925570 (the bugfix which the topotest update is trying to verify) but existed before. It was noticed only now just because there was no testcase for inter-area ECMP routes before.

Checklist

gromit1811 commented 3 months ago

I analyzed one failure (with Git 82dcb1d6, slightly newer than the one mentioned above) and I think I know what's happening, but not yet why...

In this case, R5 originated the expected Inter-Router LSA for R7, but R6 didn't (ultimately). Full logs attached:

ospf6d-startup-r5.log ospf6d-startup-r6.log

Events related to the R7 Inter-Router LSA, both from R5 (left) and R6 (right), commented:

# R6 originates the LSA in area 0:
2024/06/12 12:37:51.296041         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.296043         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.296045         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.296131         OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.296133         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.296135         OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000001 Cksum: 0fe1 Len: 32
2024/06/12 12:37:51.296215         OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:51.296216         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.296218         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

# R5 receives this and originates it in area 1 (right now, it doesn't see the direct connection to R7 in area 1 yet):
2024/06/12 12:37:51.299562 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.299563 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.299565 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:51.299566 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.299568 OSPF6: [JJ7N3-VVYV2] Originating summary in area 1 for ASBR 10.254.254.7
2024/06/12 12:37:51.299588 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.299591 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.299593 OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000001 Cksum: a623 Len: 32
2024/06/12 12:37:51.299631 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 1

# R5 now sees the direct connection to R7 and originates the LSA in area 0:
2024/06/12 12:37:51.302592 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.302594 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.302596 OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.302675 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.302677 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.302679 OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000001 Cksum: 15dc Len: 32
2024/06/12 12:37:51.302739 OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:51.302741 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.302743 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

# R5 expires its original LSA in area 1 because the path to R7 via area 0 and R6 is no longer the best:
2024/06/12 12:37:51.304010 OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.304012 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.304014 OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore
2024/06/12 12:37:51.304015 OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.304018 OSPF6: [JJ7N3-VVYV2] Originating summary in area 1 for ASBR 10.254.254.7
2024/06/12 12:37:51.304073 OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:51.304081 OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.304087 OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:51.304090 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.5]
2024/06/12 12:37:51.304091 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000001 Cksum: a623 Len: 32

# At this time, everything would be as it should.
# But R6 now thinks its LSA in area 0 no longer describes the best path (even though it still does) and expires it. This causes the missing nexthop at R1:
2024/06/12 12:37:51.346949         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.346950         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.346953         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.346954         OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:51.346963         OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.346978         OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:51.346980         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.346982         OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000001 Cksum: 0fe1 Len: 32
2024/06/12 12:37:51.347061         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.347063         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

# R6 originates its LSA again, just to expire it again immediately afterwards. This cycle repeats a few times:
2024/06/12 12:37:51.347608         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.347610         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.347612         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.347634         OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:51.347636         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.347638         OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000002 Cksum: 0de2 Len: 32
2024/06/12 12:37:51.347742         OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:51.347743         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.347798         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:51.347827         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:51.347829         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:51.347831         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:51.347833         OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:51.347841         OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.347860         OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:51.347863         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:51.347865         OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000002 Cksum: 0de2 Len: 32
2024/06/12 12:37:51.347939         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:51.347940         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:52.126621         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.126622         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.126624         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.126647         OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:52.126649         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.126651         OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000003 Cksum: 0be3 Len: 32
2024/06/12 12:37:52.126811         OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:52.126812         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.126814         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:52.126837         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.126838         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.126841         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.126843         OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:52.126850         OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.126870         OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:52.126873         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.126874         OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000003 Cksum: 0be3 Len: 32
2024/06/12 12:37:52.126948         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.126950         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:52.176949         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.176951         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.176953         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.176977         OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:52.176979         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.176981         OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000004 Cksum: 09e4 Len: 32
2024/06/12 12:37:52.177146         OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:52.177147         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.177149         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:52.177172         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:52.177173         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:52.177176         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:52.177177         OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:52.177185         OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.177205         OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:52.177207         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:52.177209         OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000004 Cksum: 09e4 Len: 32
2024/06/12 12:37:52.177282         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:52.177284         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:54.629330         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:54.629331         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:54.629333         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:54.629356         OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/06/12 12:37:54.629359         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:54.629361         OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000005 Cksum: 07e5 Len: 32
2024/06/12 12:37:54.629522         OSPF6: [XKC5G-YW93A] ospf6_abr_originate_summary_to_area : finish area 0
2024/06/12 12:37:54.629523         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:54.629525         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

2024/06/12 12:37:54.629548         OSPF6: [T0FMF-6KKCP] ospf6_abr_originate_summary: route 10.254.254.7
2024/06/12 12:37:54.629550         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 0, route 10.254.254.7
2024/06/12 12:37:54.629553         OSPF6: [JJ7N3-VVYV2] Originating summary in area 0 for ASBR 10.254.254.7
2024/06/12 12:37:54.629554         OSPF6: [XQWSV-5GJQ6] This is the secondary path to the ASBR, ignore
2024/06/12 12:37:54.629562         OSPF6: [WA749-NPMHH] LSA: Premature aging: [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:54.629582         OSPF6: [SYY2T-BK418] LSA Expire:
2024/06/12 12:37:54.629585         OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.7 Adv:10.254.254.6]
2024/06/12 12:37:54.629587         OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000005 Cksum: 07e5 Len: 32
2024/06/12 12:37:54.629661         OSPF6: [R7CVJ-7KRKN] ospf6_abr_originate_summary_to_area : start area 1, route 10.254.254.7
2024/06/12 12:37:54.629663         OSPF6: [WVWH0-3KHW3] ospf6_abr_originate_summary_to_area: The route is in the area itself, ignore

Next step is to determine what triggers the false "This is the secondary path to the ASBR, ignore" at 12:37:51.346954.

gromit1811 commented 3 months ago

For reference, this is the topo diagram from the test case:

                  .
           Area 0 . Area 1
                  .
    -- R2 ------ R5 -----
   /              .\     \
  /               . |     \
R1 --- R3 ------ R6 ------ R7
  \            / |. |
   \          /  |. |
    -- R4 ----   |. |
                / ./
              R8 --
                  .
acooks-at-bda commented 2 months ago

Humble apologies, @gromit1811. I know why the test was consistently failing on my machine - it was operator error.

I had multiple versions of FRR installed due to poor config management. I changed the configure flags at some point and neglected to specify --sbindir=/usr/lib/frr, which resulted in binaries in /usr/sbin/ as well. Definitely a trap there for new players.

Now that I've cleaned up my environment and know what I'm running, I ran the ospf6 tests a number of times and noted that this test is mostly passing. I did get an unexpected failure on the third run of the full ospf6d set, and then decided to run this test a few more times. It passed 20 consecutive times - 1/23 failures.

gromit1811 commented 2 months ago

Thanks for the update & no problem - this just means that there isn't an easier way to reproduce this. And since I wasn't able to reproduce your failure rates, I went back to my original environment in the meantime anyway.

Update for everybody else watching: No progress, unfortunately :frowning_face: I can't dedicate any time to this ATM, so this is on hold until further notice.

gromit1811 commented 1 month ago

Back from holidays and other higher prio tasks :wink:

I can now answer my question from above:

Next step is to determine what triggers the false "This is the secondary path to the ASBR, ignore" at 12:37:51.346954.

If the problem occcurs, we seem to have 2 entries for R7 in the brouter_table on the ABR (R6 above, but I've also seen cases where it happened on R5). One is the best path, so the ABR orginates an Inter-Router LSA. But the 2nd one isn't the best, so it expires the LSA again, leaving us with no Inter-Router LSA for R7 on the ABR.

Now, should we have 2 entries in the brouter_table? If this is OK, ospf6_intra_brouter_calculation should call ospf6_abr_originate_summary only for the best one. If this is not OK, we should make sure that only the best route is in brouter_table.

I'll continue investigations, unless somebody can already answer the question above.

riw777 commented 4 weeks ago

Are you seeing two router LSA's (type 1's) for R7, or two externals (type 5's)? These are standard areas, there's no NSSA, etc., here, right? We shouldn't see two router LSA's for a single router in our database. Is there any way to grab these two when they exist to see if they have different serial/sequence numbers, etc.?

I wonder if we're receiving a new LSA and just waiting too long to kill the old one?

gromit1811 commented 4 weeks ago

One Router-LSA for R7 and IIRC one AS-External-LSA. Standard areas, yes. I've got pretty verbose logs from R5 and R6 when the issue occurs (and for reference also when it doesn't), so I could look up additional things. But: I'm one step further in the meantime and I think I know what's happening, I'm just not 100% sure yet which part of the code is wrong. Sorry for not updating the issue in time, I had the impression I'm mostly talking to myself anyway :wink:

My current understanding: The issue occurs while establishing neighborships. There's a time window of a few milliseconds where not all neighborships are fully established yet and if an ABR receives the wrong LSA exactly at that time, it gets into a weird situation that persists even after all neighborships are OK. If the LSA reception doesn't hit this time window, everything works as expected. That explains why we see this only rarely. And the "wrong LSA" is actually correct at the given time with partial neighborships, but it prevents a "correct LSA" from being originated once neighborships are fully established. "LSA" in this context means Inter-Router-LSAs (type 4), all other types are looking OK.

The gory details: Let's assume the problem occurs on ABR R5 (it could happen as well on ABR R6). In this case, on R1 we only see 2 ECMP paths to R7's loopback address (i.e AS-external routes) instead of the expected 3. That happens because R5 doesn't originate its Inter-Router LSA for R7 towards area 0. R6 does that correctly, so R1 sees only 2 paths: R1-R3-R6-R7 and R1-R4-R6-R7. R1-R2-R5-R7 is missing. Checking the logs a bit closer, we actually see R5 originating its Inter-Router-LSA from time to time, but it expires it immediately afterwards again. This is after neighborships are stable.

What triggers the problem is the following sequence of events:

  1. All neighborships established except for R5-R8, R6-R8 (area 0), R6-R7 (area 1)
  2. R5 sees R7 in area 1, metric 10 and originates its Inter-Router-LSA towards area 0
  3. R5's LSA gets flooded in area 0
  4. R6 now establishes neighborship with R7 and schedules SPF, but hasn't performed that yet
  5. R6 receives R5's LSA from R3 or R4
  6. Because R6 hasn't performed SPF for area 1 yet, it doesn't know that R7 is only one metric 10 hop away. So it originates an Inter-Router LSA for R7 towards area 1 based on R5's LSA received from area 0. It uses metric 50 for that because the path to R5 has metric 40 and R5-R7 is metric 10.
  7. R7 has just become a neighbor of R6, so it will forward R6's LSA to R5 via area 1
  8. Now R5 has 2 Inter-Router LSAs for R7. The one it originated itself and the one from R6. Obviously, its own is better (metric 10 instead of 50+20=70), but the one from R6 nevertheless causes it to expire its own one.
  9. R6 now performs SPF and sees R7 via one metric 10 hop, so it expires its LSA towards area 1 and originates a new one towards area 0.
  10. The previous and now expired LSA from R6 towards area 1 was originated only a few milliseconds ago (less than MinLSArrival), so the expiration gets handled and forwarded by R7 only 5s later after the next retransmission
  11. The LSA expiration from R6 finally arrives at R5, but that doesn't seem to cause it to check whether it should originate its own LSA again, leaving us with only the one from R6

IMO there are 2 problems: In step 8, the worse LSA from R6 shouldn't "shadow" the one from R5 and shouldn't cause R5 to expire its own one. And in step 11, R5 should re-examine its LSDB and border router table and decide to originate its own LSA again.

The first problem seems to be caused by R5 having 2 routes towards R7 (prefix 10.254.254.7) in its brouter table. Its own one with path type Intra-Area and and the one from R6 with path type Inter-Area. Because they have different types, they use different entries in the brouter table. Now when ospf6_intra_brouter_calculation on R5 iterates the brouter table, it first finds R5's own entry (marked "best"), originates an LSA and in the next iteration, finds the R6 one (not marked "best") and immediately expires its LSA again. I don't know which one of the following variants would be correct:

  1. The R6 brouter entry sholudn't be in the brouter table at all
  2. The R6 brouter entry should be there but just as an additional path for R5's entry
  3. The R6 brouter entry should be there as a separate entry as it is and ospf6_intra_brouter_calculation needs to handle this while iterating (i.e. originating the LSA if there is one best entry, ignoring all other non-best ones for the same prefix but with different type)

I don't know yet what's causing the 2nd problem in step 11. Strictly speaking, we have hooks notifying us about brouter table and LSDB updates, but for some reason, they don't seem to be sufficient.

OSPFv2 seems to handle at least the first problem correctly, so I could probably check how it's implemented there. But the implementation seems to be sufficiently different so that they're probably not directly comparable :frowning:

gromit1811 commented 2 days ago

New test case that reproduces this reliably in #16797

Topology:

                  .
           Area 0 . Area 1
                  .
    -- R2 ------ R5 -----
   /              .      \
  /               .       \
R1 --- R3 ------ R6 ------ R7
  \            /  .\
   \          /   . \
    -- R4 ----    .  ------R8
                  .

Link R6-R7 is down initially, so R1 sees a route to R8 only via R6 (with 2 nexthops, because there are 2 ECMP paths to R6). After link up, we expect additional nexthops to R7, but we don't expect nexthops to R8 to change, because even though it is now reachable via R5 as well, the best path is still only via R6.

What actually happens is that after link up, R1 routes to R8 only via R5 and ignores R6, reducing the number of nexthops to 1. This is because R6 expires its Inter-Router LSA towards area 0 after link up. It does that for the same reason that caused the intermittent failure of the original test case: Before link up, R5 creates an Inter-Router LSA for area 1 based on the one received from R6 via area 0. After link up, R6 receives this via area 1, creates a border_router table entry based on that and that shadows the brouter table entry R6 used initially to originate its Inter-Router LSA towards area 0. So R6 expires that LSA and thus R1 no longer can use this path. Note that after R6 has expired its LSA, R5 receives the LS update and expires its LSA for area 1 as well. But even that doesn't resolve the situation on R6.

BTW, we're only looking at AS-external routes here which every router originates (based on loopback addresses redistributed into OSPFv3).

gromit1811 commented 2 days ago

Logs from a test run with #16797 (note that they contain additional log messages, mostly marked with "XXX" - feel free to ignore): r5_ospf6d.log r6_ospf6d.log

Commented LSA events extracted from these logs:

2024/09/11 14:04:46.928182 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:04:46.928193 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:04:46.928200 OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000001 Cksum: 13db Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log

2024/09/11 14:04:46.940890 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.2%r5-eth0
2024/09/11 14:04:46.940901 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:04:46.940907 OSPF6: [HA4BY-46K3Y]     Age:    4 SeqNum: 0x80000001 Cksum: 13db Len: 32
# ^^^ r5_ospf6d.log

2024/09/11 14:04:46.950329 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.4%r6-eth1
2024/09/11 14:04:46.950339 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:04:46.950345 OSPF6: [HA4BY-46K3Y]     Age:    4 SeqNum: 0x80000001 Cksum: 13db Len: 32
# ^^^ r6_ospf6d.log

2024/09/11 14:04:50.352100 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:04:50.352104 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:04:50.352107 OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000001 Cksum: aa1d Len: 32 (-> area 1)
# ^^^ r5_ospf6d.log

# Link R6-R7 up:
2024/09/11 14:04:53.611915 OSPF6: [TB8WK-36NCK] Interface Event r6-eth2: [InterfaceUp]

2024/09/11 14:05:07.485588 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.7%r6-eth2
2024/09/11 14:05:07.485594 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:07.485598 OSPF6: [HA4BY-46K3Y]     Age:   20 SeqNum: 0x80000001 Cksum: aa1d Len: 32
# ^^^ r6_ospf6d.log
# R6 receives R5's area 1 LSA based on its own LSA originated at 14:04:46.928182

2024/09/11 14:05:12.485356 OSPF6: [SYY2T-BK418] LSA Expire:
2024/09/11 14:05:12.485366 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.485372 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000001 Cksum: 13db Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
# ...and that causes it to erroneously expire its own LSA

2024/09/11 14:05:12.489660 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:05:12.489669 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.489676 OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000002 Cksum: 11dc Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
# ...just to originate it again 4ms later

2024/09/11 14:05:12.490348 OSPF6: [SYY2T-BK418] LSA Expire:
2024/09/11 14:05:12.490358 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.490363 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000002 Cksum: 11dc Len: 32 (-> area 0)
# ^^^ r6_ospf6d.log
# ...and then to expire it one more time immediately afterwards

2024/09/11 14:05:12.509459 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.2%r5-eth0
2024/09/11 14:05:12.509464 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.509467 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000002 Cksum: 11dc Len: 32
# ^^^ r5_ospf6d.log

2024/09/11 14:05:12.509703 OSPF6: [SYY2T-BK418] LSA Expire:
2024/09/11 14:05:12.509830 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.509834 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000001 Cksum: aa1d Len: 32 (-> area 1)
# ^^^ r5_ospf6d.log
# R5 expires its LSA based on the (just expired) LSA from R6

2024/09/11 14:05:12.522146 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.4%r6-eth1
2024/09/11 14:05:12.522149 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.6]
2024/09/11 14:05:12.522151 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000002 Cksum: 11dc Len: 32
# ^^^ r6_ospf6d.log

2024/09/11 14:05:12.522193 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.7%r6-eth2
2024/09/11 14:05:12.522196 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.522198 OSPF6: [HA4BY-46K3Y]     Age: 3600 SeqNum: 0x80000001 Cksum: aa1d Len: 32
# ^^^ r6_ospf6d.log

2024/09/11 14:05:12.529361 OSPF6: [TQB4M-JQ72Z] LSA Originate:
2024/09/11 14:05:12.529364 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.529365 OSPF6: [HA4BY-46K3Y]     Age:    0 SeqNum: 0x80000001 Cksum: e1f9 Len: 32 (-> area 0)
# ^^^ r5_ospf6d.log
# ...and then creates a proper one for area 0 based on the path to R8 via area 1

2024/09/11 14:05:12.533631 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.3%r6-eth0
2024/09/11 14:05:12.533633 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.533634 OSPF6: [HA4BY-46K3Y]     Age:    4 SeqNum: 0x80000001 Cksum: e1f9 Len: 32
# ^^^ r6_ospf6d.log

2024/09/11 14:05:12.533781 OSPF6: [M3K4C-7HG7V] LSA Receive from 10.254.254.4%r6-eth1
2024/09/11 14:05:12.533783 OSPF6: [GNRNM-HX7W3]     [Inter-Router Id:10.254.254.8 Adv:10.254.254.5]
2024/09/11 14:05:12.533784 OSPF6: [HA4BY-46K3Y]     Age:    4 SeqNum: 0x80000001 Cksum: e1f9 Len: 32
# ^^^ r6_ospf6d.log

The expire/originate/expire cycle is typical for this case: Expire first on LSA reception, then iterate the brouter table and find the intra-area path first (-> originate), followed by the inter-area path from the received LSA (-> expire).

gromit1811 commented 1 day ago

I now also analyzed the 2nd problem mentioned in https://github.com/FRRouting/frr/issues/16197#issuecomment-2293383832 (failure to re-originate the Inter-Router LSA on R6 towards area 0 after the temporary Inter-Router LSA from R5 towards area 1 has been expired): The expired LSA seems to get handled properly and triggers removal of the corresponding brouter table entry on R6. All this processing is performed based on the route derived from the expired LSA in ospf6_abr_examin_summary(). But after removal, nobody checks whether there might be another brouter table route for the same ASBR and area (with different path type) left which should now be used to originate an Inter-Router LSA towards area 0 instead. If I hack to code to call ospf6_intra_brouter_calculation() after handling the expired LSA, R6 will originate a proper Inter-Router LSA towards area 0, fixing the 2nd problem. Or rather, I don't consider this to be a real fix because a full ospf6_intra_brouter_calculation() is overkill if only a single brouter changed...