OLSR / olsrd

OLSR.org main repository - olsrd v1 - maintained by Freifunk Berlin
Other
83 stars 65 forks source link

v0.9.6 has instable routing #20

Closed SvenRoederer closed 7 years ago

SvenRoederer commented 7 years ago

around d8ffc6f I was running this code on some Freifunk-nodes and experienced that the routes are not stable. Some routes went of the routing table and came back every few seconds.

A test on my local-node with looping ip route sh table olsr|wc -l gave changing number of routes, even the network was stable / no changes in the local mesh.

I used the exact same config as for olsrd 0.9.0.3, where all was running normally before and after checking with 0.9.6.

fhuberts commented 7 years ago

Could you bisect?

bittorf commented 7 years ago

i will also make a large test with running route-count in RRD in all nodes

bittorf commented 7 years ago

what i can also see with 0.9.6:

root@b:~ wget -O /dev/null http://127.0.0.1:2006
Downloading 'http://127.0.0.1:2006'
Connecting to 127.0.0.1:2006
(null)                   0   - stalled -
Connection reset prematurely

(the same for /all or /lin) - IMHO this was already fixed? the correct filename is configured:

root@EG-bastianHQ:~ :) cat /var/etc/olsrd.conf

DebugLevel 0
AllowNoInt yes
ClearScreen no
IpVersion 4
FIBMetric "flat"
Willingness 7
TcRedundancy 2
LinkQualityFishEye 1
LinkQualityAlgorithm "etx_ffeth"
MprCoverage 7
MainIp 10.63.222.1

Hna4
{
                10.63.222.0 255.255.255.192
                100.65.220.0 255.255.255.0
}

LoadPlugin "olsrd_txtinfo.so.1.1"
{
        PlParam "accept" "0.0.0.0"
        PlParam "port" "2006"
}

LoadPlugin "olsrd_nameservice.so.0.4"
{
        PlParam "name" "EG-bastianHQ"
        PlParam "name-change-script" "/etc/udhcpc.user"
}

Interface "eth0.2"
{
        Ip4Broadcast 255.255.255.255
        Mode "ether"
        HelloInterval 3.0
        HelloValidityTime 125.0
        TcInterval 2.0
        TcValidityTime 500.0
        MidInterval 25.0
        MidValidityTime 500.0
        HnaInterval 10.0
        HnaValidityTime 125.0
}

Interface "eth0.1"
{
        Ip4Broadcast 255.255.255.255
        Mode "ether"
        HelloInterval 3.0
        HelloValidityTime 125.0
        TcInterval 2.0
        TcValidityTime 500.0
        MidInterval 25.0
        MidValidityTime 500.0
        HnaInterval 10.0
        HnaValidityTime 125.0
}

Interface "wlan0"
{
        Ip4Broadcast 255.255.255.255
        Mode "mesh"
        HelloInterval 3.0
        HelloValidityTime 125.0
        TcInterval 2.0
        TcValidityTime 500.0
        MidInterval 25.0
        MidValidityTime 500.0
        HnaInterval 10.0
        HnaValidityTime 125.0
}

Interface "wlan1"
{
        Ip4Broadcast 255.255.255.255
        Mode "mesh"
        HelloInterval 3.0
        HelloValidityTime 125.0
        TcInterval 2.0
        TcValidityTime 500.0
        MidInterval 25.0
        MidValidityTime 500.0
        HnaInterval 10.0
        HnaValidityTime 125.0
}
fhuberts commented 7 years ago

running the same version and I see:

# wget -O /dev/null http://127.0.0.1:2006
converted 'http://127.0.0.1:2006' (ANSI_X3.4-1968) -> 'http://127.0.0.1:2006' (UTF-8)
--2017-02-07 11:29:32--  http://127.0.0.1:2006/
Connecting to 127.0.0.1:2006... connected.
HTTP request sent, awaiting response... 200 No headers, assuming HTTP/0.9
Length: unspecified
Saving to: '/dev/null'

/dev/null                                                [ <=>                                                                                                                    ]  22.33K  --.-KB/s   in 0s     

2017-02-07 11:29:32 (126 MB/s) - '/dev/null' saved [22868]
fhuberts commented 7 years ago

does that node have neighbours?

bittorf commented 7 years ago

@fhuberts yes, a lot of neighbours - wired and wireless - i understand that it only works with netcat and not wget - a change my code for this - so: everything is fine - sorry for the noise

fhuberts commented 7 years ago

it works alright with wget, at least it should

if it doesn't work with wget can you send me the packet grab of the request? That would mean the request parsing doesn't work properly, especially if netcat works just fine (both should work).

What doesn't work is a (manual) telnet connection

bittorf commented 7 years ago

it looks like this (captured on the laptop, querying a router with GNU wget - this works - but on the router itself the wget does seem to know http 0.9)

bastian@X301-II ~ $ sudo tcpdump -nXi wlan0 host 10.63.222.33
[sudo] password for bastian: 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), capture size 65535 bytes
12:45:07.139704 IP 100.66.3.186.34156 > 10.63.222.33.2006: Flags [S], seq 3109579389, win 29200, options [mss 1460,sackOK,TS val 57238513 ecr 0,nop,wscale 7], length 0
    0x0000:  4500 003c 098c 4000 4006 e0d3 6442 03ba  E..<..@.@...dB..
    0x0010:  0a3f de21 856c 07d6 b958 6a7d 0000 0000  .?.!.l...Xj}....
    0x0020:  a002 7210 6d20 0000 0204 05b4 0402 080a  ..r.m...........
    0x0030:  0369 63f1 0000 0000 0103 0307            .ic.........
12:45:07.143447 IP 10.63.222.33.2006 > 100.66.3.186.34156: Flags [S.], seq 590398867, ack 3109579390, win 28960, options [mss 1460,sackOK,TS val 339576 ecr 57238513,nop,wscale 4], length 0
    0x0000:  4500 003c 0000 4000 3f06 eb5f 0a3f de21  E..<..@.?.._.?.!
    0x0010:  6442 03ba 07d6 856c 2330 c593 b958 6a7e  dB.....l#0...Xj~
    0x0020:  a012 7120 56c1 0000 0204 05b4 0402 080a  ..q.V...........
    0x0030:  0005 2e78 0369 63f1 0103 0304            ...x.ic.....
12:45:07.143498 IP 100.66.3.186.34156 > 10.63.222.33.2006: Flags [.], ack 1, win 229, options [nop,nop,TS val 57238514 ecr 339576], length 0
    0x0000:  4500 0034 098d 4000 4006 e0da 6442 03ba  E..4..@.@...dB..
    0x0010:  0a3f de21 856c 07d6 b958 6a7e 2330 c594  .?.!.l...Xj~#0..
    0x0020:  8010 00e5 f5c4 0000 0101 080a 0369 63f2  .............ic.
    0x0030:  0005 2e78                                ...x
12:45:07.143926 IP 100.66.3.186.34156 > 10.63.222.33.2006: Flags [P.], seq 1:119, ack 1, win 229, options [nop,nop,TS val 57238514 ecr 339576], length 118
    0x0000:  4500 00aa 098e 4000 4006 e063 6442 03ba  E.....@.@..cdB..
    0x0010:  0a3f de21 856c 07d6 b958 6a7e 2330 c594  .?.!.l...Xj~#0..
    0x0020:  8018 00e5 7a4a 0000 0101 080a 0369 63f2  ....zJ.......ic.
    0x0030:  0005 2e78 4745 5420 2f6c 696e 2048 5454  ...xGET./lin.HTT
    0x0040:  502f 312e 310d 0a55 7365 722d 4167 656e  P/1.1..User-Agen
    0x0050:  743a 2057 6765 742f 312e 3135 2028 6c69  t:.Wget/1.15.(li
    0x0060:  6e75 782d 676e 7529 0d0a 4163 6365 7074  nux-gnu)..Accept
    0x0070:  3a20 2a2f 2a0d 0a48 6f73 743a 2031 302e  :.*/*..Host:.10.
    0x0080:  3633 2e32 3232 2e33 333a 3230 3036 0d0a  63.222.33:2006..
    0x0090:  436f 6e6e 6563 7469 6f6e 3a20 4b65 6570  Connection:.Keep
    0x00a0:  2d41 6c69 7665 0d0a 0d0a                 -Alive....
12:45:07.149222 IP 10.63.222.33.2006 > 100.66.3.186.34156: Flags [.], ack 119, win 1810, options [nop,nop,TS val 339576 ecr 57238514], length 0
    0x0000:  4500 0034 ebec 4000 3f06 ff7a 0a3f de21  E..4..@.?..z.?.!
    0x0010:  6442 03ba 07d6 856c 2330 c594 b958 6af4  dB.....l#0...Xj.
    0x0020:  8010 0712 ef21 0000 0101 080a 0005 2e78  .....!.........x
    0x0030:  0369 63f2                                .ic.
12:45:07.149250 IP 10.63.222.33.2006 > 100.66.3.186.34156: Flags [P.], seq 1:547, ack 119, win 1810, options [nop,nop,TS val 339576 ecr 57238514], length 546
    0x0000:  4500 0256 ebed 4000 3f06 fd57 0a3f de21  E..V..@.?..W.?.!
    0x0010:  6442 03ba 07d6 856c 2330 c594 b958 6af4  dB.....l#0...Xj.
    0x0020:  8018 0712 0cf5 0000 0101 080a 0005 2e78  ...............x
    0x0030:  0369 63f2 5461 626c 653a 204c 696e 6b73  .ic.Table:.Links
    0x0040:  0a4c 6f63 616c 2049 5009 5265 6d6f 7465  .Local.IP.Remote
    0x0050:  2049 5009 4879 7374 2e09 4c51 094e 4c51  .IP.Hyst..LQ.NLQ
...
bittorf commented 7 years ago

i will also capture a failed variant (give me some time)

bittorf commented 7 years ago

interesting: i get/see the data, but the openwrt-wget ("uclient-fetch") aborts. the error is not on the olsr-side 8-) imho:

root@EG-superbuffi76:~ :) tcpdump -nXi eth1 host 10.63.222.33 and port 2006
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
12:50:20.752654 IP 10.63.6.125.35638 > 10.63.222.33.2006: Flags [S], seq 596364484, win 29200, options [mss 1460,sackOK,TS val 37581585 ecr 0,nop,wscale 4], length 0
    0x0000:  4500 003c bfd5 4000 4006 81ca 0a3f 067d  E..<..@.@....?.}
    0x0010:  0a3f de21 8b36 07d6 238b ccc4 0000 0000  .?.!.6..#.......
    0x0020:  a002 7210 e42b 0000 0204 05b4 0402 080a  ..r..+..........
    0x0030:  023d 7311 0000 0000 0103 0304            .=s.........
12:50:20.753548 IP 10.63.222.33.2006 > 10.63.6.125.35638: Flags [S.], seq 4075327443, ack 596364485, win 28960, options [mss 1460,sackOK,TS val 371012 ecr 37581585,nop,wscale 4], length 0
    0x0000:  4500 003c 0000 4000 4006 41a0 0a3f de21  E..<..@.@.A..?.!
    0x0010:  0a3f 067d 07d6 8b36 f2e8 8fd3 238b ccc5  .?.}...6....#...
    0x0020:  a012 7120 b904 0000 0204 05b4 0402 080a  ..q.............
    0x0030:  0005 a944 023d 7311 0103 0304            ...D.=s.....
12:50:20.753785 IP 10.63.6.125.35638 > 10.63.222.33.2006: Flags [.], ack 1, win 1825, options [nop,nop,TS val 37581585 ecr 371012], length 0
    0x0000:  4500 0034 bfd6 4000 4006 81d1 0a3f 067d  E..4..@.@....?.}
    0x0010:  0a3f de21 8b36 07d6 238b ccc5 f2e8 8fd4  .?.!.6..#.......
    0x0020:  8010 0721 51cd 0000 0101 080a 023d 7311  ...!Q........=s.
    0x0030:  0005 a944                                ...D
12:50:20.759783 IP 10.63.6.125.35638 > 10.63.222.33.2006: Flags [P.], seq 1:45, ack 1, win 1825, options [nop,nop,TS val 37581585 ecr 371012], length 44
    0x0000:  4500 0060 bfd7 4000 4006 81a4 0a3f 067d  E..`..@.@....?.}
    0x0010:  0a3f de21 8b36 07d6 238b ccc5 f2e8 8fd4  .?.!.6..#.......
    0x0020:  8018 0721 5087 0000 0101 080a 023d 7311  ...!P........=s.
    0x0030:  0005 a944 4745 5420 2f6c 696e 2048 5454  ...DGET./lin.HTT
    0x0040:  502f 312e 310d 0a48 6f73 743a 2031 302e  P/1.1..Host:.10.
    0x0050:  3633 2e32 3232 2e33 333a 3230 3036 0d0a  63.222.33:2006..
12:50:20.760516 IP 10.63.222.33.2006 > 10.63.6.125.35638: Flags [.], ack 45, win 1810, options [nop,nop,TS val 371013 ecr 37581585], length 0
    0x0000:  4500 0034 2690 4000 4006 1b18 0a3f de21  E..4&.@.@....?.!
    0x0010:  0a3f 067d 07d6 8b36 f2e8 8fd4 238b ccf1  .?.}...6....#...
    0x0020:  8010 0712 51af 0000 0101 080a 0005 a945  ....Q..........E
    0x0030:  023d 7311                                .=s.
12:50:20.762514 IP 10.63.6.125.35638 > 10.63.222.33.2006: Flags [P.], seq 45:72, ack 1, win 1825, options [nop,nop,TS val 37581586 ecr 371013], length 27
    0x0000:  4500 004f bfd8 4000 4006 81b4 0a3f 067d  E..O..@.@....?.}
    0x0010:  0a3f de21 8b36 07d6 238b ccf1 f2e8 8fd4  .?.!.6..#.......
    0x0020:  8018 0721 511c 0000 0101 080a 023d 7312  ...!Q........=s.
    0x0030:  0005 a945 5573 6572 2d41 6765 6e74 3a20  ...EUser-Agent:.
    0x0040:  7563 6c69 656e 742d 6665 7463 680d 0a    uclient-fetch..
12:50:20.762780 IP 10.63.222.33.2006 > 10.63.6.125.35638: Flags [.], ack 72, win 1810, options [nop,nop,TS val 371013 ecr 37581586], length 0
    0x0000:  4500 0034 2691 4000 4006 1b17 0a3f de21  E..4&.@.@....?.!
    0x0010:  0a3f 067d 07d6 8b36 f2e8 8fd4 238b cd0c  .?.}...6....#...
    0x0020:  8010 0712 5193 0000 0101 080a 0005 a945  ....Q..........E
    0x0030:  023d 7312                                .=s.
12:50:20.762991 IP 10.63.6.125.35638 > 10.63.222.33.2006: Flags [P.], seq 72:74, ack 1, win 1825, options [nop,nop,TS val 37581586 ecr 371013], length 2
    0x0000:  4500 0036 bfd9 4000 4006 81cc 0a3f 067d  E..6..@.@....?.}
    0x0010:  0a3f de21 8b36 07d6 238b cd0c f2e8 8fd4  .?.!.6..#.......
    0x0020:  8018 0721 4470 0000 0101 080a 023d 7312  ...!Dp.......=s.
    0x0030:  0005 a945 0d0a                           ...E..
12:50:20.763162 IP 10.63.222.33.2006 > 10.63.6.125.35638: Flags [.], ack 74, win 1810, options [nop,nop,TS val 371013 ecr 37581586], length 0
    0x0000:  4500 0034 2692 4000 4006 1b16 0a3f de21  E..4&.@.@....?.!
    0x0010:  0a3f 067d 07d6 8b36 f2e8 8fd4 238b cd0e  .?.}...6....#...
    0x0020:  8010 0712 5191 0000 0101 080a 0005 a945  ....Q..........E
    0x0030:  023d 7312                                .=s.
12:50:20.798902 IP 10.63.222.33.2006 > 10.63.6.125.35638: Flags [P.], seq 1:547, ack 74, win 1810, options [nop,nop,TS val 371017 ecr 37581586], length 546
    0x0000:  4500 0256 2693 4000 4006 18f3 0a3f de21  E..V&.@.@....?.!
    0x0010:  0a3f 067d 07d6 8b36 f2e8 8fd4 238b cd0e  .?.}...6....#...
    0x0020:  8018 0712 5d59 0000 0101 080a 0005 a949  ....]Y.........I
    0x0030:  023d 7312 5461 626c 653a 204c 696e 6b73  .=s.Table:.Links
    0x0040:  0a4c 6f63 616c 2049 5009 5265 6d6f 7465  .Local.IP.Remote
    0x0050:  2049 5009 4879 7374 2e09 4c51 094e 4c51  .IP.Hyst..LQ.NLQ
...

here the wget-command:

root@EG-superbuffi76:~ :) wget -O - http://10.63.222.33:2006/lin
Downloading 'http://10.63.222.33:2006/lin'
Connecting to 10.63.222.33:2006
(null)                   0   - stalled -
Connection reset prematurely
fhuberts commented 7 years ago

olsrd only supports http 1.0 and 1.1 requests

fhuberts commented 7 years ago

can you show me the dump of uclient-fetch?

bittorf commented 7 years ago

i just wondered, because GNU wget emits the warning:

bastian@X301-II ~ $ LC_ALL=C wget -O /dev/null http://10.63.222.33:2006/lin
--2017-02-07 12:57:39--  http://10.63.222.33:2006/lin
Connecting to 10.63.222.33:2006... connected.
HTTP request sent, awaiting response... 200 No headers, assuming HTTP/0.9
bittorf commented 7 years ago

the uclient-fetch dump is 2 comments above under "give me some time".

fhuberts commented 7 years ago

try wget ..../http/lin

bittorf commented 7 years ago

wow, this works....so...uh! this change I was not aware of...

root@EG-superbuffi76:~ :) uclient-fetch -qO - http://10.63.222.33:2006/http/lin
Table: Links
Local IP    Remote IP   Hyst.   LQ  NLQ Cost
10.63.222.33    10.63.160.161   0.000   1.000   1.000   0.100
10.63.222.33    10.63.6.125 0.000   1.000   1.000   0.100
10.63.222.33    10.63.42.125    0.000   1.000   1.000   0.100
10.63.222.3 10.63.80.195    0.000   1.000   1.000   1.000
10.63.222.3 10.63.197.131   0.000   0.862   1.000   1.158
10.63.222.1 10.63.2.1   0.000   0.972   0.117   8.739
10.63.222.3 10.63.233.129   0.000   0.976   1.000   1.023
10.63.222.3 10.63.6.67  0.000   1.000   1.000   1.000
10.63.222.3 10.63.156.131   0.000   1.000   1.000   1.000
10.63.222.3 10.63.135.193   0.000   0.972   0.000   INFINITE
fhuberts commented 7 years ago
0.9.6 -------------------------------------------------------------------

* The versions of the following plugins have changed:
  - jsoninfo    :   0.0 -->   1.1
  - nameservice :   0.3 -->   0.4
  - netjson     :   1.0 -->   1.1
  - pud         : 2.0.0 --> 3.0.0 (including its extra libraries)
  - txtinfo     :   0.1 -->   1.1

* All info plugins (jsoninfo, netjson and txtinfo) now support a number of
  request prefixes:
  - /http : forces output WITH    http headers, temporarily overriding the
            configured "httpheaders" value.
  - /plain: forces output WITHOUT http headers, temporarily overriding the
            configured "httpheaders" value.

  These prefixes have to be at the start of the request string, can occur
  only there, and can occur only once.
fhuberts commented 7 years ago

@bittorf I've pushed a commit to automatically detect whether http headers are needed, please try it

info: automatically detect whether the reply should have HTTP headers

This is the case when a HTTP request is done.
The request can still override whether or not HTTP headers are sent
by employing the 'http' and 'plain' request prefixes.
fhuberts commented 7 years ago

@SvenRoederer Could you please try again with the most recent commit on the release (or master) branch?

SvenRoederer commented 7 years ago

Was quite busy for me these days ...

I just build a updated firmware (https://buildbot.berlin.freifunk.net/buildbot/unstable/ar71xx-generic/98/VERSION.txt) with OLSRd v0.9.6.1 (release-tag; via https://github.com/SvenRoederer/openwrt-routing/commit/dde4487dac14b902af4deaf7a8d96006a69bb520) and can still see the problem.

this script monitors the routes:

#!/bin/sh

echo -e "all routes\tolsr-table"
while true; do
 echo -e "$(ip route sh table all |wc -l | tr -d '\n')\t\t$(ip route show table olsr |wc -l)"
 sleep 10
done

outputs:

root@SAm0815-test-glar150:~# /root/bin/check_olsr-routes.sh 
all routes      olsr-table
1168            1094
77              3
1168            1094
1168            1094
1167            1093
77              3
77              3
1009            3
750             3
77              3
1167            1093
1173            1099
77              3
1173            1099
77              3
1166            1092
77              3
77              3
77              3
1167            1093
1167            1093
1167            1093
1167            1093
1164            1090
1164            1090
1166            778
77              3
77              3
77              3
1175            1101
77              3
77              3
1169            1095
1169            1095

my olsrd.config is:

root@SAm0815-test-glar150:~# cat /var/etc/olsrd.conf 

DebugLevel 0
AllowNoInt yes
IpVersion 4
FIBMetric "flat"
TcRedundancy 2
NatThreshold 0.75
LinkQualityAlgorithm "etx_ff"
SmartGateway yes
SmartGatewayThreshold 50
Pollrate 0.025
RtTable 111
RtTableDefault 112
RtTableTunnel 113
RtTableTunnelPriority 100000
RtTableDefaultOlsrPriority 20000
SmartGatewaySpeed 1000 3000
SmartGatewayUplink "both"

Hna4
{
                10.230.197.208 255.255.255.240
}

LoadPlugin "olsrd_arprefresh.so.0.1"
{
}

LoadPlugin "olsrd_watchdog.so.0.1"
{
        PlParam "file" "/var/run/olsrd.watchdog"
        PlParam "interval" "30"
}

LoadPlugin "olsrd_dyn_gw.so.0.5"
{
        PlParam "Ping" "85.214.20.141"
        PlParam "Ping" "213.73.91.35"
        PlParam "Ping" "194.150.168.168"
        PlParam "PingCmd" "ping -c 1 -q -I ffvpn %s"
        PlParam "PingInterval" "30"
}

InterfaceDefaults
{
        MidValidityTime 500.0
        TcInterval 2.0
        HnaValidityTime 125.0
        HelloValidityTime 125.0
        TcValidityTime 500.0
        Ip4Broadcast 255.255.255.255
        MidInterval 25.0
        HelloInterval 3.0
        HnaInterval 10.0
}

Interface "wlan0-adhoc-2"
{
}
SvenRoederer commented 7 years ago

Any infos I can provide in addition?

fhuberts commented 7 years ago

yes, please try bisecting it. go back to a release/commit that works ok for you and do a git bisect

bittorf commented 7 years ago

i cannot see these problems in my testnet with 70 nodes and mixed OLSR-versions. @SvenRoederer is it maybe because you have a restarting daemon (false positives in your watchdog?)

fhuberts commented 7 years ago

we don't see it either in our test network, seems very stable

SvenRoederer commented 7 years ago

@bittorf no external watchdog is running (PID still the same); removing all plugins don't changes anything

@pmelange also reported no problems on his installation. (https://github.com/freifunk-berlin/firmware/issues/418#issuecomment-277550028). I feel it might be related to the BBB-VPN and the lowered link-qualitiy

SvenRoederer commented 7 years ago

@fhuberts I started the bisect today and directly hit the issue I had seen in 0.9.5 already (https://github.com/freifunk-berlin/firmware/issues/424) So it crashes even before any routes getting installed.

fhuberts commented 7 years ago

yes, so skip that commit

SvenRoederer commented 7 years ago

this one seems to be a completely different problem than this "instable routes" here. The 0.9.6-code is not crashing with the ASSERT. I'd like to make sure that we don't mix up 2 different problems

fhuberts commented 7 years ago

that issue was fixed

fhuberts commented 7 years ago

ah no it wasn't. this is the first time I hear of this issue, why wasn't it reported?

fhuberts commented 7 years ago

2e7f2942bd47fc7f0b4ca0bf4581c3dee3c1f85a probably did fix it

SvenRoederer commented 7 years ago

regarding the assert-thing: I had seen it on 0.9.5, but as I was not seen in 0.9.6, I assumed it was fixed by intention. The move to 0.9.6 was done as of the "filechange-interval", and I forgot about 0.9.5 ... I bisect to a commit where 2e7f294 was still included and the ASSERT still failed

SvenRoederer commented 7 years ago

this assert thing can be tracked in https://github.com/freifunk-berlin/firmware/issues/424

SvenRoederer commented 7 years ago

bisecting results in: (which seems really unrelated)

8cef7bf8a03420eebab5b23db4ec4d2a203aeec3 is the first bad commit
commit 8cef7bf8a03420eebab5b23db4ec4d2a203aeec3
Author: Ferry Huberts <ferry.huberts@pelagic.nl>
Date:   Mon Nov 9 15:49:11 2015 +0100

    lock_file: add olsr_remove_lock_file function

    And use it in the error paths during creation

    Signed-off-by: Ferry Huberts <ferry.huberts@pelagic.nl>

:040000 040000 501ff49e204e000119558163d19072709ea5c706 848ebe27591e72ec022efa1d5d3f8dbe09e3c8ce M      src

bisect was running from v0.9.0.3 to 2e568eb7264dd9df3fcf68db83 (next commit would introduce the ASSERT-issue again)

git bisect start
# good: [c6fbdafd11ef1d31cbbeab138317c3fdd6673d1a] Release v0.9.0.3
git bisect good c6fbdafd11ef1d31cbbeab138317c3fdd6673d1a
# bad: [2e568eb7264dd9df3fcf68db835b066adee6546f] main: minor update
git bisect bad 2e568eb7264dd9df3fcf68db835b066adee6546f
# good: [e21085a327cbb682ff91f8236800a79d9e9eb301] mdns: update a comment about exit
git bisect good e21085a327cbb682ff91f8236800a79d9e9eb301
# bad: [736d46ec3109f97864c2d35ca35438e3bbcae9ff] main: move loading the config into the loadConfig function
git bisect bad 736d46ec3109f97864c2d35ca35438e3bbcae9ff
# good: [83d40b74acf3fa46e1780cd569f0fc3c412f8d45] quagga: clean up olsr_exit messages
git bisect good 83d40b74acf3fa46e1780cd569f0fc3c412f8d45
# good: [79ce902e56b2ceaf4ba6749187b1cec78fc94bc3] main: always store argv
git bisect good 79ce902e56b2ceaf4ba6749187b1cec78fc94bc3
# good: [5d6a4ce069945b670067ac8a875a16793c2f43fd] lock_file: move olsrd_get_default_lockfile into its own file
git bisect good 5d6a4ce069945b670067ac8a875a16793c2f43fd
# bad: [8cef7bf8a03420eebab5b23db4ec4d2a203aeec3] lock_file: add olsr_remove_lock_file function
git bisect bad 8cef7bf8a03420eebab5b23db4ec4d2a203aeec3
# good: [6fa811140b4dae2af6c8d04e7666dd5a8f714f35] main: move olsr_create_lock_file into its own file
git bisect good 6fa811140b4dae2af6c8d04e7666dd5a8f714f35
# first bad commit: [8cef7bf8a03420eebab5b23db4ec4d2a203aeec3] lock_file: add olsr_remove_lock_file function
fhuberts commented 7 years ago

That commit can not possibly result in unstable routing. Please run the bisect properly between 0.9.0.3 and 0.9.6.1. If you run into the assert that is blocking you then just cherry-pick 97d4916, that should fix that problem for you.

SvenRoederer commented 7 years ago

yeah, I was also wondering about 8cef7bf8a03420eebab5b23db4ec4d2a203aeec3 "is the first bad commit". Then did the following to double-check:

As these commits seem unrelated, I did the other way around:

fhuberts commented 7 years ago

That is very confusing. Just tell me which tree you bisected and what the results are. Now I have to search between trees and I bet I'm not doing it right because it's too confusing.

SvenRoederer commented 7 years ago

at the end it happens on master between 13aa7f3 and 5af5485

just check: https://github.com/SvenRoederer/olsrd/commits/find_route-problem_simplified (last 3 commits)

fhuberts commented 7 years ago

That doesn't make sense. I don't see how adding some static data - that is totally unused in the routing related code - could have that effect. Are you sure that your bisect is correct?

SvenRoederer commented 7 years ago

I agree with you, I also was wondering very much. That this commit was supposed to cause this, was the reason for trying different ways to isolate this commit. But every time I came to the result that after this commit the routes came and go... Btw. I used the Makefile from the openwret-routing-feed and adjusted the PKG_SOURCE_VERSION to my needs. Also disabling the addons completly be commenting out SUBDIRS did not change anything.

Maybe you like to look at your own in a setup like mine, by connecting to the BBB-VPN (http://bbb-vpn.berlin.freifunk.net/cgi-bin-index.html)

pmelange commented 7 years ago

Using OLSR 0.9.6-git_1b51a49-hash_60d038da5dd8ad0e53f8f55729562986 is see no problems with the routing tables as Sven has described them. My Freifunk test-node ist connected directly to the berlin network and not over the BBB-VPN. As far as I can tell, that is the only difference.

Some time in the next couple days, I will update the snapshot and reflash the router to see if everything is still working fine. Until then...

fhuberts commented 7 years ago

Ok thanks for the report! This gives me a bit more information...

@SvenRoederer So the only difference is the BBB-VPN. That node seems to be running olsrd as well, and I bet it's not updated to 0.9.6.1 yet... I'm also betting that when you upgrade that node to 0.9.6.1 that your unstable routing problem is gone.

pmelange commented 7 years ago

Just some more informaion:

The first hop for my test node is over ad-hoc wifi. The first hop node is running 0.9.0.3-git_1b6dc2e-hash_217925b912d7d2155bea6239a46ae95c.

@fhuberts I don't know what version is running on the bbb-vpn server, but I'm meshing fine with an older version.

fhuberts commented 7 years ago

0.9.0.3 has the 'fragmented hellos' problem.

how many neighbors are there

pmelange commented 7 years ago

Here are the neighbors. Also note, on the test node only olsr4 is running

Test node: OLSR 0.9.6-git_1b51a49-hash_60d038da5dd8ad0e53f8f55729562986


root@perry-test:~# neigh.sh 
Local        Remote         vTime  LQ       NLQ      Cost     Host 
10.31.23.145 10.230.226.194 141908 1.000000 1.000000 1.000000 mid7.scherer8.olsr 
10.31.23.144 10.230.226.193 137889 0.983000 0.886000 1.145508 scherer8.olsr 

nc: can't connect to remote host: Connection refused
Failed to parse message data

First hop: OLSR 0.9.0.3-git_1b6dc2e-hash_217925b912d7d2155bea6239a46ae95c


root@scherer8:~# neigh.sh 
Local          Remote         vTime  LQ       NLQ      Cost 
10.230.226.202 10.31.6.53     140951 1.000000 1.000000 1024 
10.230.226.211 10.230.226.212 141396 1.000000 1.000000 1024 
10.230.226.211 10.230.226.213 141477 1.000000 1.000000 1024 
10.230.226.203 10.31.31.77    138762 0.191000 1.000000 5328 
10.230.226.193 10.31.23.144   136613 0.831000 0.991000 1241 
10.230.226.194 10.31.23.145   141304 1.000000 1.000000 1024 

Local                Remote               vTime  LQ       NLQ      Cost 
2001:bf7:750:2e0b::1 2001:bf7:750:2e1b::1 135906 1.000000 1.000000 102  
2001:bf7:750:2e02::1 2001:bf7:836:a::1    139080 1.000000 1.000000 1024 
2001:bf7:750:2e0b::1 2001:bf7:750:2e2b::1 141641 1.000000 1.000000 102  
2001:bf7:750:2e03::1 2001:bf7:800:103::1  137759 1.000000 1.000000 1024 

Second Hop 1: OLSR 0.9.0.3-git_1b6dc2e-hash_7aaa60310210a745b5b00863c99fae6b


root@scherer8-abb:~# neigh.sh 
Local          Remote         vTime  LQ       NLQ      Cost 
10.230.226.212 10.230.226.211 142362 1.000000 1.000000 1024 
10.230.226.212 10.230.226.213 136814 1.000000 1.000000 1024 

Local                Remote               vTime  LQ       NLQ      Cost 
2001:bf7:750:2e1b::1 2001:bf7:750:2e0b::1 137255 1.000000 1.000000 102 
2001:bf7:750:2e1b::1 2001:bf7:750:2e2b::1 138847 1.000000 1.000000 102 

Second Hop 2: OLSR 0.9.0.3-git_1b6dc2e-hash_217925b912d7d2155bea6239a46ae95c


root@basta:~# neigh.sh 
Local          Remote         vTime  LQ       NLQ      Cost 
10.230.226.213 10.230.226.211 136894 1.000000 1.000000 1024 
10.230.226.213 10.230.226.212 141903 1.000000 1.000000 1024 

Local                Remote               vTime  LQ       NLQ      Cost 
2001:bf7:750:2e2b::1 2001:bf7:750:2e1b::1 139174 1.000000 1.000000 102 
2001:bf7:750:2e2b::1 2001:bf7:750:2e0b::1 136695 1.000000 1.000000 102 

Second Hop 3: OLSR 0.6.7.1-git_cebcd32-hash_e30a1ec38cc6d414bb747f8018021d59


root@tub-core:~# neigh.sh 
Local          Remote         vTime  LQ       NLQ      Cost 
10.31.31.1     10.31.28.161   138053 1.000000 1.000000 1024 
10.31.31.77    10.230.226.203 137186 0.979000 0.195000 5326 
10.31.31.77    10.31.48.6     138409 1.000000 1.000000 1024 
10.230.145.173 10.230.18.181  137343 1.000000 1.000000 1024 
10.230.145.173 10.31.13.49    143084 1.000000 0.948000 1079 
10.31.31.75    10.230.44.100  136743 1.000000 1.000000 1024 
10.230.145.173 10.230.242.133 139771 1.000000 0.972000 1052 
10.230.145.173 10.31.26.61    140113 1.000000 0.956000 1070 
10.31.31.73    10.31.3.1      142879 1.000000 1.000000 1024 
10.31.31.81    10.31.1.40     131301 0.435000 0.757000 3108 

Local               Remote               vTime  LQ       NLQ      Cost 
2001:bf7:800:103::1 2001:bf7:750:2a80::1 137414 1.000000 1.000000 1024 
2001:bf7:800:103::1 2001:bf7:750:2e03::1 136139 1.000000 1.000000 1024 
2001:bf7:800:102::1 2001:bf7:800:2::1    142713 1.000000 1.000000 1024 

Second Hop 4: OLSR 0.6.7.1-git_cebcd32-hash_2e2a1899170f7d0456572b7c247d1f07


root@segen-core:~# neigh.sh 
Local      Remote         vTime  LQ       NLQ      Cost 
10.31.6.33 10.31.12.171   136828 0.897000 1.000000 1140 
10.31.6.1  10.31.6.85     140879 1.000000 1.000000 1024 
10.31.6.1  10.31.6.65     137912 1.000000 1.000000 1024 
10.31.6.37 10.31.2.45     138326 1.000000 1.000000 1024 
10.31.6.45 10.31.27.41    136798 1.000000 0.897000 1140 
10.31.6.53 10.230.226.202 142380 1.000000 1.000000 1024 
10.31.6.1  10.31.6.69     139094 1.000000 1.000000 1024 
10.31.6.1  10.31.6.73     138395 1.000000 1.000000 1024 
10.31.6.33 10.31.55.45    134650 1.000000 1.000000 1024 
10.31.6.33 10.31.33.17    142352 1.000000 0.909000 1125 
10.31.6.1  10.31.6.93     137276 1.000000 1.000000 1024 
10.31.6.1  10.31.6.89     139385 1.000000 1.000000 1024 
10.31.6.37 10.31.4.73     139458 1.000000 1.000000 1024 
10.31.6.1  10.31.6.81     140031 1.000000 1.000000 1024 
10.31.6.33 10.36.40.73    138082 0.940000 1.000000 1088 
10.31.6.41 10.31.11.93    136991 1.000000 1.000000 1024 
10.31.6.49 10.230.23.143  38714  1.000000 1.000000 1024 
10.31.6.33 10.31.13.13    139812 0.956000 0.772000 1385 
10.31.6.1  10.31.6.77     134949 1.000000 1.000000 1024 

Local             Remote               vTime  LQ       NLQ      Cost 
2001:bf7:836::1   2001:bf7:836:71::1   137788 1.000000 1.000000 1024 
2001:bf7:836::1   2001:bf7:836:51::1   139124 1.000000 1.000000 1024 
2001:bf7:836::1   2001:bf7:836:10::1   138285 1.000000 1.000000 1024 
2001:bf7:836::1   2001:bf7:836:31::1   138409 1.000000 1.000000 1024 
2001:bf7:836::1   2001:bf7:836:20::1   138126 1.000000 1.000000 1024 
2001:bf7:836:7::1 fd9c:1d37:4f28:8::1  138129 1.000000 1.000000 1024 
2001:bf7:836:7::1 2001:bf7:830:9::1    138435 1.000000 1.000000 1024 
2001:bf7:836:5::1 2001:bf7:760:805::1  136476 1.000000 1.000000 1024 
2001:bf7:836:5::1 fd8b:6aff:97af:2::1  136701 0.940000 1.000000 1088 
2001:bf7:836::1   2001:bf7:836:80::1   138255 1.000000 1.000000 1024 
2001:bf7:836:a::1 2001:bf7:750:2e02::1 140386 1.000000 1.000000 1024 
2001:bf7:836::1   2001:bf7:836:61::1   140734 1.000000 1.000000 1024 
2001:bf7:836:6::1 2001:bf7:831:3::1    137070 1.000000 1.000000 1024 
2001:bf7:836:9::1 2001:bf7:750:1205::1 37835  1.000000 1.000000 1024 
2001:bf7:836:5::1 2001:bf7:760:8411::1 138209 0.979000 0.772000 1351 
2001:bf7:836::1   2001:bf7:836:41::1   137077 1.000000 1.000000 1024 
fhuberts commented 7 years ago

ok, not enough nodes to cause fragmentation.

@SvenRoederer How's this for you?

pmelange commented 7 years ago

The status of the BBB-VPN which Sven uses can be seen here.

fhuberts commented 7 years ago

Ok, not enough nodes to suffer from the fragmented hellos problem.

SvenRoederer commented 7 years ago

The BBB-VPN node seems to run something like olsr 0.6.x ( @booo gave me this info after a short look) but @sven-ola might know best

SvenRoederer commented 7 years ago

@fhuberts is this "fragmented hellos" problem exclusively for 0.9.0.3 or any version up to 0.9.0.3?