Open tackin opened 4 years ago
Your numbers two and three should be resolved by the merge of #58 . Can you confirm that, @tackin ? About the load peaks I cannot say anything, yet.
Need to install/test it again for 2. and 3. If 1. is fixed. No. 3. is a YANIC thing. (May be solved) No. 2. is not clear to me, if it is a YANIC- or mesh-announce-bug.
@AiyionPrime Tested: No. 3 seems to be solved. No. 2 is not solved.
@AiyionPrime Your PR#58 solves problem no.2 for me.
The laod peaks appear in hannover as well, but seem to correlate with fastd's cpu usage (likely the context switches) and not mesh-announce. How to reproduce the finding of mesh-announce being the evil one?
By simply disable the service and see what happend.
See pict above. The Loadpeaks stopped when I stopped the service on rustig and erai.
Thanks, I will try to reproduce it tonight.
First things first, hanover has the same issue on all four supernodes. The peaks are always about one hour and 45 minutes apart from each other (averaged over the last day).
One thing to note is, they don't peak or start to spike at the same time. We watched the load this day and could not find anything but fastd and ocassionally mesh-announce in the top 10 of htop.
At 20:30 we stopped the mesh-announce service, the resulting graph is this one.
As you can see, this does drastically reduce the load, but doesn't prevent the spike altogether. As it appears mesh-announce is responsible for part of the load, but not the triggering event itself. Therefore I can confirm the bug, a workaround that reduces the load is possibly to use the multi-domain feature in one instance.
Trier likely had a bigger impact by mesh-announce, as they had more instances running. I'll try that tomorrow, for now our supernodes are busily tested in order to exclude causes like or monitoring, our zabbix, our whatsoever.
Looking over to Trier the load appears to peak in the same frequency: https://draco.freifunk-trier.starletp9.de:3000/d/Gb1_MoJik/freifunk-trier-uberblick?orgId=1
Quite possible, that I miss the forest for the trees, but I can't figure out, whats triggering after 105 minutes, independent of when a system booted.
@moridius just stopped fastd on a supernode, it drastically reduces the spike as well, but not completely, if mesh-announce is left running. @tackin have you already taken dumps of the traffic for two or three period-lenghts?
@tackin have you already taken dumps of the traffic for two or three period-lenghts?
No, sorry, I have no idea where/what to look for in a dump. For us stopping fastd also would drop all tunnels and traffic. Would not make sense in testing I guess.
Well, then. Yesterday 20:30 I've shut down the first supernode 09, reducing its temp-load drastically, as seen in the last graph. Thid did not change in the last 16? hours.
Today, 13:00 o'clock I've shut down the other mesh-announce instances as well. They all showed the same result, drastic reduction of their load in the peak window.
The second shutdown did not effect the loadpeak on sn09 at all. My conclusion stands, mesh-announce is responsible for (part of) the loadpeak, but for the event triggering it, it is not.
Here is the current graph, sn[01,08,09,10] are currently all of our supernodes running mesh-announce. The red dot marks 13:05, when my shutdown of the remaining three instances took effect.
We'll start tcpdumps later this afternoon. I'm now firing up mesh-announce again.
I got my non-findings of the event and the resulting load peer reviewed yesterday. Unlikely, that tcpdumps will help at this point already. Will determine, whether darmstadts fork had the issue as well. If not, go back to the fork determine it wasn't an issue back then, too and finally bisect, when things went south. Will do this after lunch.
Does this issue still exist? There have been major changes in mesh-announce and thus additional confirmation on this issue is required. This issue will be closed in a month if there is no further activity.
@TobleMiner Sorry, didn't find the time yet to test it. It's not a big issue/problem for us at the moment, so I feel no pressure. ;-) I'll come back to it a.s.a.p.
The Gateways Erai an Rustig are using our fork (https://github.com/freifunktrier/mesh-announce) of this repo. I have had 3 problems:
I shiftet to my older mesh-announce fork from ffda (multicast on ff02:....) and my problems are gone.