ffnord / mesh-announce

Discussion at #mesh-announce:irc.hackint.org and (separately) at
https://matrix.to/#/!MjLIHcALOcENXZWQlH:irc.hackint.org/$1547640760901FmKaD:matrix.eclabs.de
13 stars 47 forks source link

Load-Peaks and still not multidomain-usable #57

Open tackin opened 4 years ago

tackin commented 4 years ago

load-peaks The Gateways Erai an Rustig are using our fork (https://github.com/freifunktrier/mesh-announce) of this repo. I have had 3 problems:

  1. Load peaks
  2. permanent changing wrong node-data in YANIC
  3. warnings like: Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.179+01:00" level="warn" msg="override nodeID from 2661965025dc to 266196502501 on MAC address 26:61:96:60:25:05" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(Nodes).readIfaces" Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.208+01:00" level="warn" msg="override nodeID from 266196502504 to 266196502505 on MAC address 26:61:96:60:25:04" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(Nodes).readIfaces" Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.209+01:00" level="warn" msg="override nodeID from 2661965025dc to 266196502501 on MAC address 26:61:96:60:25:05" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(Nodes).readIfaces" Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.211+01:00" level="warn" msg="override nodeID from 266196501003 to 2661965010dc on MAC address 26:61:96:60:10:dc" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(Nodes).readIfaces" Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.216+01:00" level="warn" msg="override nodeID from 266196502504 to 266196502505 on MAC address 26:61:96:60:25:04" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(Nodes).readIfaces" Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.217+01:00" level="warn" msg="override nodeID from 2661965025dc to 266196502501 on MAC address 26:61:96:60:25:05" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(Nodes).readIfaces" Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.218+01:00" level="warn" msg="override nodeID from 266196501003 to 2661965010dc on MAC address 26:61:96:60:10:dc" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"

I shiftet to my older mesh-announce fork from ffda (multicast on ff02:....) and my problems are gone.

AiyionPrime commented 4 years ago

Your numbers two and three should be resolved by the merge of #58 . Can you confirm that, @tackin ? About the load peaks I cannot say anything, yet.

tackin commented 4 years ago

Need to install/test it again for 2. and 3. If 1. is fixed. No. 3. is a YANIC thing. (May be solved) No. 2. is not clear to me, if it is a YANIC- or mesh-announce-bug.

tackin commented 4 years ago

@AiyionPrime Tested: No. 3 seems to be solved. No. 2 is not solved.

tackin commented 4 years ago

@AiyionPrime Your PR#58 solves problem no.2 for me.

AiyionPrime commented 4 years ago

The laod peaks appear in hannover as well, but seem to correlate with fastd's cpu usage (likely the context switches) and not mesh-announce. How to reproduce the finding of mesh-announce being the evil one?

tackin commented 4 years ago

By simply disable the service and see what happend.

See pict above. The Loadpeaks stopped when I stopped the service on rustig and erai.

AiyionPrime commented 4 years ago

Thanks, I will try to reproduce it tonight.

AiyionPrime commented 4 years ago

First things first, hanover has the same issue on all four supernodes. The peaks are always about one hour and 45 minutes apart from each other (averaged over the last day).

One thing to note is, they don't peak or start to spike at the same time. We watched the load this day and could not find anything but fastd and ocassionally mesh-announce in the top 10 of htop.

At 20:30 we stopped the mesh-announce service, the resulting graph is this one.

As you can see, this does drastically reduce the load, but doesn't prevent the spike altogether. As it appears mesh-announce is responsible for part of the load, but not the triggering event itself. Therefore I can confirm the bug, a workaround that reduces the load is possibly to use the multi-domain feature in one instance.

Trier likely had a bigger impact by mesh-announce, as they had more instances running. I'll try that tomorrow, for now our supernodes are busily tested in order to exclude causes like or monitoring, our zabbix, our whatsoever.

Looking over to Trier the load appears to peak in the same frequency: https://draco.freifunk-trier.starletp9.de:3000/d/Gb1_MoJik/freifunk-trier-uberblick?orgId=1

Quite possible, that I miss the forest for the trees, but I can't figure out, whats triggering after 105 minutes, independent of when a system booted.

AiyionPrime commented 4 years ago

@moridius just stopped fastd on a supernode, it drastically reduces the spike as well, but not completely, if mesh-announce is left running. @tackin have you already taken dumps of the traffic for two or three period-lenghts?

tackin commented 4 years ago

@tackin have you already taken dumps of the traffic for two or three period-lenghts?

No, sorry, I have no idea where/what to look for in a dump. For us stopping fastd also would drop all tunnels and traffic. Would not make sense in testing I guess.

AiyionPrime commented 4 years ago

Well, then. Yesterday 20:30 I've shut down the first supernode 09, reducing its temp-load drastically, as seen in the last graph. Thid did not change in the last 16? hours.

Today, 13:00 o'clock I've shut down the other mesh-announce instances as well. They all showed the same result, drastic reduction of their load in the peak window.

The second shutdown did not effect the loadpeak on sn09 at all. My conclusion stands, mesh-announce is responsible for (part of) the loadpeak, but for the event triggering it, it is not.

Here is the current graph, sn[01,08,09,10] are currently all of our supernodes running mesh-announce. The red dot marks 13:05, when my shutdown of the remaining three instances took effect.

We'll start tcpdumps later this afternoon. I'm now firing up mesh-announce again.

AiyionPrime commented 4 years ago

I got my non-findings of the event and the resulting load peer reviewed yesterday. Unlikely, that tcpdumps will help at this point already. Will determine, whether darmstadts fork had the issue as well. If not, go back to the fork determine it wasn't an issue back then, too and finally bisect, when things went south. Will do this after lunch.

TobleMiner commented 3 years ago

Does this issue still exist? There have been major changes in mesh-announce and thus additional confirmation on this issue is required. This issue will be closed in a month if there is no further activity.

tackin commented 3 years ago

@TobleMiner Sorry, didn't find the time yet to test it. It's not a big issue/problem for us at the moment, so I feel no pressure. ;-) I'll come back to it a.s.a.p.