Network-wide OLSR error on the Berlin Freifunk Network

pmelange commented 5 years ago

This is a copy of issue https://github.com/freifunk-berlin/firmware/issues/628

On 22.11.2019 all of the routers on the entire Berlin Backbone started printing the following error messages, repeating every second.

Thu Nov 22 13:53:08 2018 daemon.info olsrd[7015]: Received netlink error
code Invalid argument (-22)
Thu Nov 22 13:53:08 2018 daemon.err olsrd[7015]: . error: del route to
171.159.48.121/254.0.0.0 via 0.0.0.0 dev void onlink (Resource
temporarily unavailable 11)
Thu Nov 22 13:53:08 2018 daemon.err olsrd[7015]: Delete route
171.159.48.121/7 via 0.0.0.0: Resource temporarily unavailable

The only known methods to stop the error messages was to restart the OLSR4 service or to reboot the router.

This also effected every router attach to the BBB-VPN.

Every version of the OLSR daemon was hit by this problem. From 0.6.x to the latest 0.9.6.2

The cause of this message is unknown.

HRogge commented 5 years ago

"via 0.0.0.0" looks strange...

is there a node that pretends to by 0.0.0.0 ?

Henning

On Mon, Nov 26, 2018 at 4:29 PM pmelange notifications@github.com wrote:

This is a copy of issue freifunk-berlin/firmware#628 https://github.com/freifunk-berlin/firmware/issues/628

On 22.11.2019 all of the routers on the entire Berlin Backbone started printing the following error messages, repeating every second.

Thu Nov 22 13:53:08 2018 daemon.info olsrd[7015]: Received netlink error code Invalid argument (-22) Thu Nov 22 13:53:08 2018 daemon.err olsrd[7015]: . error: del route to171.159.48.121/254.0.0.0 via 0.0.0.0 dev void onlink (Resource temporarily unavailable 11) Thu Nov 22 13:53:08 2018 daemon.err olsrd[7015]: Delete route171.159.48.121/7 via 0.0.0.0: Resource temporarily unavailable

The only known methods to stop the error messages was to restart the OLSR4 service or to reboot the router.

This also effected every router attach to the BBB-VPN.

Every version of the OLSR daemon was hit by this problem. From 0.6.x to the latest 0.9.6.2

The cause of this message is unknown.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OLSR/olsrd/issues/66, or mute the thread https://github.com/notifications/unsubscribe-auth/AG8Ytx0E5cVr4JPWDHT3MA4sfyNbECNcks5uzAjegaJpZM4YzQmf .

pmelange commented 5 years ago

Yes, very strange. No, I don't think that any node on the mesh network would announce 0.0.0.0. Also the 171.159.48.121 address (with a huge netmask) is strange. All the nodes on our mesh network are in the 10.0.0.0/8 address space. And no nodes should state that they have a /8 netmask either.

Take a look at the email thread on the freifunk-berlin mailing list. https://lists.berlin.freifunk.net/pipermail/berlin/2018-November/038406.html

HRogge commented 5 years ago

It sounds like one of the nodes introduce wrong/bad data into your network... olsrd does not make any consistency checks. On Wed, Dec 5, 2018 at 11:40 PM pmelange notifications@github.com wrote:

Yes, very strange. No, I don't think that any node on the mesh network would announce 0.0.0.0. Also the 171.159.48.121 address (with a huge netmask) is strange. All the nodes on our mesh network are in the 10.0.0.0/8 address space. And no nodes should state that they have a /8 netmask either.

Take a look at the email thread on the freifunk-berlin mailing list. https://lists.berlin.freifunk.net/pipermail/berlin/2018-November/038406.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

pmelange commented 5 years ago

I suppose it could be user error on some node. But the result is the same. Unfortunately I didn't check the routing tables to see if there is also a "via 0.0.0.0" entry.

If it is not possible for OLSR to delete a route, shouldn't OLSR handle it differently than repeatedly retrying to delete it?

HRogge commented 5 years ago

Olsrd could not have set the route in the first case because it would throw an error.

Maybe its just a case of memory corruption, e.g. done by a plugin. We had something similar with a special version of the mdns plugin years ago.

I don't know.

On Thu, Dec 6, 2018 at 9:28 PM pmelange notifications@github.com wrote:

I suppose it could be user error on some node. But the result is the same. Unfortunately I didn't check the routing tables to see if there is also a "via 0.0.0.0" entry.

If it is not possible for OLSR to delete a route, shouldn't OLSR handle it differently than repeatedly retrying to delete it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OLSR/olsrd/issues/66#issuecomment-445018570, or mute the thread https://github.com/notifications/unsubscribe-auth/AG8Yt7sAbYB-Xj0zvobaeSmnlULrnMAHks5u2X4EgaJpZM4YzQmf .

pmelange commented 5 years ago

I guess it was some kind of mass hysteria memory corruption. I don't know how to repeat it.

Well, if it happens again (it might take a few years) shall I open another ticket?

HRogge commented 5 years ago

On Thu, Dec 6, 2018 at 11:36 PM pmelange notifications@github.com wrote:

I guess it was some kind of mass hysteria memory corruption. I don't know how to repeat it.

Maybe you could ask if someone has installed a new/experimental plugin for their Olsrd... sometimes it is enough that ONE node in the mesh has installed something new to kill the whole mesh.

I have a small "consistency check" plugin for olsrd2, which could be expanded to filter for "bad addresses/prefixes".

Well, if it happens again (it might take a few years) shall I open another ticket?

If it happens again and you still remember this thread, please reopen this one.

Please also notice that olsrd(1) is without a maintainer...

pmelange commented 5 years ago

After one week, there was no answer on the freifunk-berlin mailing list. Closing

pktpls commented 2 years ago

This happened again today with a /4 HNA (errors above for a /7). When that HNA was withdrawn/expires, its removal from the kernel routing table started looping with the errors mentioned above.

@PolynomialDivision could you reopen?

mathiashro commented 2 weeks ago

Hi @pktpls / @pmelange , anything we can do here (as the case is open for quite some time)?

pmelange commented 2 weeks ago

I haven't seen this happen again and i don't know how to reproduce it.

mathiashro commented 2 weeks ago

Thank you. I‘ll close it here the moment, we can reopen it once someone catches the error again.

Hope this is fine for you as well.

OLSR / olsrd

Network-wide OLSR error on the Berlin Freifunk Network #66