OLSR / olsrd

OLSR.org main repository - olsrd v1 - maintained by Freifunk Berlin
Other
84 stars 65 forks source link

AREDN network storms: What they look like, how they happen, and how to prevent them #106

Open aanon4 opened 3 years ago

aanon4 commented 3 years ago

Network storms are the result of the OLSR demon on nodes restarting. A restart randomly resets the sequence number that will be used for new messages. Under certain circumstances these new sequence numbered messages can interact with old messages from the same source which are still in the network and create a message storm. The combined old and new messages confuse the deduplication code so that messages will always appear new, regardless of how many times a node received them, and will always be duplicated to their neighbors. This will continue until all copies of messages time-to-live expires on all nodes.

Full explanation and proposed solutions can be found here: https://docs.google.com/document/d/1OgURb2O36lWF518dKydJLBEEUTYt7khRTVzX8QChwqw/edit?usp=sharing

storchi commented 3 years ago

great work. thanks

HRogge commented 3 years ago

Maybe you want to take a look at the deduplication code in OLSRv2 too... I think it has a few more heuristics than the old one. https://github.com/OLSR/OONF/blob/master/src/base/oonf_duplicate_set.c

bittorf commented 3 years ago

wow, that reminds me on old headaches - thanks a lot for finding the underlying issue!

HRogge commented 3 years ago

And its not only the sequence number, it can happen with the AnswerSet-number too...

aanon4 commented 3 years ago

Maybe you want to take a look at the deduplication code in OLSRv2 too... I think it has a few more heuristics than the old one. https://github.com/OLSR/OONF/blob/master/src/base/oonf_duplicate_set.c

I had a quick look at the code and the the IEFT OLSRv2 draft, and they both still seem to use 16-bit sequence numbers with similar wrap around comparison logic. Is there something specific?

HRogge commented 3 years ago

The sequence number handling code of (my) OLSRv2 implementation handles a jump in the sequence number by counting continuous sequences of "very old" numbers without any new ones... and triggers a "reset" in the duplication code after a while. Unfortunately this doesn't help with OLSRv1 and OLSRv2 ANSN, because they don't necessarily increase all the time (but in practice they do).

mathisono commented 3 years ago

Henning is "The sequence number handling code of (my) OLSRv2 " in the current release code? Would it be safe to say that If ARDEN stopped using the old OLSRv1 and used OLSRv2 we would stop having paralyzing storms on our networks?

On Wed, Sep 1, 2021 at 10:40 PM Henning Rogge @.***> wrote:

The sequence number handling code of (my) OLSRv2 implementation handles a jump in the sequence number by counting continuous sequences of "very old" numbers without any new ones... and triggers a "reset" in the duplication code after a while. Unfortunately this doesn't help with OLSRv1 and OLSRv2 ANSN, because they don't necessarily increase all the time (but in practice they do).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OLSR/olsrd/issues/106#issuecomment-911211740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEW6LEVR72W2TJ2IVJIRL3T74E5TANCNFSM5DCX2KEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

HRogge commented 3 years ago

Yes, its in the current released code...

we (Fraunhofer FKIE) have issues with restarting OLSRv2 nodes, but I have never seen these "storms" you were talking about. What I have seen is that routers ignore changes in the attached network that happens together with a router restart (we use OLSRv2 with a dynamic attached network source)...

this "ignore attached network change" is because of the ANSN issue, which is similar to the SEQNO issue but more difficult to solve because ANSN can remain constant in theory (especially in emulated networks).

the easiest way might be to store the ANSN as well as the SEQNO numbers (to make router restarts generally faster), but I have yet to find the time to write code that stores both message sequence numbers, packet sequence numbers (per interface!) and the ANSN and tries to reload it on a router restart.

aanon4 commented 3 years ago

I've submitted a downstream pull request for AREDN only (https://github.com/aredn/aredn_packages/pull/5)

PolynomialDivision commented 2 years ago

@aanon4 Could you submit a PR based on the patch with all you have write in the github Massage also in the commit message?

mathiashro commented 4 weeks ago

Hi @PolynomialDivision , should we take over some patches here?