Nordix / Meridio

Facilitator of attraction and distribution of external traffic within Kubernetes via secondary networks
https://meridio.nordix.org
Apache License 2.0
46 stars 9 forks source link

Some targets are not receiving traffic after scaling #234

Open LionelJouin opened 2 years ago

LionelJouin commented 2 years ago

Describe the bug The issue is similar to this one: https://github.com/Nordix/Meridio/issues/55 After scaling the targets, for instance from 4 to 5, some targets are not receiving traffic for x seconds while they are correctly configured in nfqlb with correct IP rules and IP routes. It's random, most of the time it works correctly, sometimes only 1 is receiving traffic, sometimes 2.

I tried with ctraffic and mconnect, the result is the same.

To Reproduce

  1. Scale out targets
  2. Wait x (like 20) seconds
  3. Send traffic

Expected behavior All targets should receive the traffic.

Context

Running on Kind, I haven't tried any other environment, but it happened to the internal CI team.

Logs

time=2022-06-03T11:08:45+02:00 level=info msg=load-balancer-trench-a-666f6bc986-99kpp: <nil> - Shm: tshm-stream-a
  Fw: own=0
  Maglev: M=9973, N=100
   Lookup: 30 68 68 57 16 16 16 75 30 16 16 16 75 16 16 75 30 16 16 16 75 16 30 75 68...
   Active: 17(16) 31(30) 58(57) 69(68) 76(75)

time=2022-06-03T11:08:45+02:00 level=info msg=load-balancer-trench-a-666f6bc986-99kpp: <nil> - 0:   from all lookup local
93: from all fwmark 0x11 lookup 17
94: from all fwmark 0x3a lookup 58
95: from all fwmark 0x1f lookup 31
98: from all fwmark 0x45 lookup 69
99: from all fwmark 0x4c lookup 76
100:    from 20.0.0.1 lookup 4096
100:    from 40.0.0.0/24 lookup 4096
32766:  from all lookup main
32767:  from all lookup default

time=2022-06-03T11:08:45+02:00 level=info msg=load-balancer-trench-a-666f6bc986-99kpp: <nil> - default via 169.254.100.150 dev ext-vlan.100 table 4096 proto bird metric 32 
default via 172.16.0.50 dev load-balan-861a table 17 
default via 172.16.0.36 dev load-balan-861a table 31 
default via 172.16.1.44 dev load-balan-5f4e table 58 
default via 172.16.0.6 dev load-balan-861a table 69 
default via 172.16.1.6 dev load-balan-5f4e table 76 
default via 10.244.2.1 dev eth0 
10.244.2.0/24 via 10.244.2.1 dev eth0 src 10.244.2.5 
10.244.2.1 dev eth0 scope link src 10.244.2.5 
169.254.100.0/24 dev ext-vlan.100 proto kernel scope link src 169.254.100.2 
172.16.0.0/24 dev load-balan-861a proto kernel scope link src 172.16.0.2 
172.16.1.0/24 dev load-balan-5f4e proto kernel scope link src 172.16.1.4 
local 10.244.2.5 dev eth0 table local proto kernel scope host src 10.244.2.5 
broadcast 10.244.2.255 dev eth0 table local proto kernel scope link src 10.244.2.5 
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 
local 169.254.100.2 dev ext-vlan.100 table local proto kernel scope host src 169.254.100.2 
broadcast 169.254.100.255 dev ext-vlan.100 table local proto kernel scope link src 169.254.100.2 
local 172.16.0.2 dev load-balan-861a table local proto kernel scope host src 172.16.0.2 
broadcast 172.16.0.255 dev load-balan-861a table local proto kernel scope link src 172.16.0.2 
local 172.16.1.4 dev load-balan-5f4e table local proto kernel scope host src 172.16.1.4 
broadcast 172.16.1.255 dev load-balan-5f4e table local proto kernel scope link src 172.16.1.4 
default via 100:100::150 dev ext-vlan.100 table 4096 proto bird metric 32 pref medium
default via fd00::32 dev load-balan-861a table 17 metric 1024 pref medium
default via fd00::24 dev load-balan-861a table 31 metric 1024 pref medium
default via fd00:0:0:1::2c dev load-balan-5f4e table 58 metric 1024 pref medium
default via fd00::6 dev load-balan-861a table 69 metric 1024 pref medium
default via fd00:0:0:1::6 dev load-balan-5f4e table 76 metric 1024 pref medium
100:100::/64 dev ext-vlan.100 proto kernel metric 256 pref medium
fd00::/64 dev load-balan-861a proto kernel metric 256 pref medium
fd00:0:0:1::/64 dev load-balan-5f4e proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev ext-vlan.100 proto kernel metric 256 pref medium
fe80::/64 dev load-balan-861a proto kernel metric 256 pref medium
fe80::/64 dev load-balan-5f4e proto kernel metric 256 pref medium
local ::1 dev lo table local proto kernel metric 0 pref medium
anycast 100:100:: dev ext-vlan.100 table local proto kernel metric 0 pref medium
local 100:100::2 dev ext-vlan.100 table local proto kernel metric 0 pref medium
anycast fd00:: dev load-balan-861a table local proto kernel metric 0 pref medium
local fd00::2 dev load-balan-861a table local proto kernel metric 0 pref medium
anycast fd00:0:0:1:: dev load-balan-5f4e table local proto kernel metric 0 pref medium
local fd00:0:0:1::4 dev load-balan-5f4e table local proto kernel metric 0 pref medium
anycast fe80:: dev eth0 table local proto kernel metric 0 pref medium
anycast fe80:: dev ext-vlan.100 table local proto kernel metric 0 pref medium
anycast fe80:: dev load-balan-5f4e table local proto kernel metric 0 pref medium
anycast fe80:: dev load-balan-861a table local proto kernel metric 0 pref medium
local fe80::42:acff:fe12:3 dev ext-vlan.100 table local proto kernel metric 0 pref medium
local fe80::fe:59ff:febf:9523 dev load-balan-861a table local proto kernel metric 0 pref medium
local fe80::fe:78ff:fef0:9892 dev load-balan-5f4e table local proto kernel metric 0 pref medium
local fe80::54f9:19ff:fe30:e6d1 dev eth0 table local proto kernel metric 0 pref medium
multicast ff00::/8 dev eth0 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev ext-vlan.100 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev load-balan-861a table local proto kernel metric 256 pref medium
multicast ff00::/8 dev load-balan-5f4e table local proto kernel metric 256 pref medium

time=2022-06-03T11:08:45+02:00 level=info msg=load-balancer-trench-a-666f6bc986-qjd4m: <nil> - Shm: tshm-stream-a
  Fw: own=0
  Maglev: M=9973, N=100
   Lookup: 30 68 68 57 16 16 16 75 30 16 16 16 75 16 16 75 30 16 16 16 75 16 30 75 68...
   Active: 17(16) 31(30) 58(57) 69(68) 76(75)

time=2022-06-03T11:08:45+02:00 level=info msg=load-balancer-trench-a-666f6bc986-qjd4m: <nil> - 0:   from all lookup local
93: from all fwmark 0x11 lookup 17
94: from all fwmark 0x3a lookup 58
95: from all fwmark 0x1f lookup 31
98: from all fwmark 0x45 lookup 69
99: from all fwmark 0x4c lookup 76
100:    from 20.0.0.1 lookup 4096
100:    from 40.0.0.0/24 lookup 4096
32766:  from all lookup main
32767:  from all lookup default

time=2022-06-03T11:08:45+02:00 level=info msg=load-balancer-trench-a-666f6bc986-qjd4m: <nil> - default via 169.254.100.150 dev ext-vlan.100 table 4096 proto bird metric 32 
default via 172.16.0.50 dev load-balan-5a6d table 17 
default via 172.16.0.36 dev load-balan-5a6d table 31 
default via 172.16.1.44 dev load-balan-3a3f table 58 
default via 172.16.0.6 dev load-balan-5a6d table 69 
default via 172.16.1.6 dev load-balan-3a3f table 76 
default via 10.244.1.1 dev eth0 
10.244.1.0/24 via 10.244.1.1 dev eth0 src 10.244.1.7 
10.244.1.1 dev eth0 scope link src 10.244.1.7 
169.254.100.0/24 dev ext-vlan.100 proto kernel scope link src 169.254.100.1 
172.16.0.0/24 dev load-balan-5a6d proto kernel scope link src 172.16.0.4 
172.16.1.0/24 dev load-balan-3a3f proto kernel scope link src 172.16.1.2 
local 10.244.1.7 dev eth0 table local proto kernel scope host src 10.244.1.7 
broadcast 10.244.1.255 dev eth0 table local proto kernel scope link src 10.244.1.7 
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 
local 169.254.100.1 dev ext-vlan.100 table local proto kernel scope host src 169.254.100.1 
broadcast 169.254.100.255 dev ext-vlan.100 table local proto kernel scope link src 169.254.100.1 
local 172.16.0.4 dev load-balan-5a6d table local proto kernel scope host src 172.16.0.4 
broadcast 172.16.0.255 dev load-balan-5a6d table local proto kernel scope link src 172.16.0.4 
local 172.16.1.2 dev load-balan-3a3f table local proto kernel scope host src 172.16.1.2 
broadcast 172.16.1.255 dev load-balan-3a3f table local proto kernel scope link src 172.16.1.2 
default via 100:100::150 dev ext-vlan.100 table 4096 proto bird metric 32 pref medium
default via fd00::32 dev load-balan-5a6d table 17 metric 1024 pref medium
default via fd00::24 dev load-balan-5a6d table 31 metric 1024 pref medium
default via fd00:0:0:1::2c dev load-balan-3a3f table 58 metric 1024 pref medium
default via fd00::6 dev load-balan-5a6d table 69 metric 1024 pref medium
default via fd00:0:0:1::6 dev load-balan-3a3f table 76 metric 1024 pref medium
100:100::/64 dev ext-vlan.100 proto kernel metric 256 pref medium
fd00::/64 dev load-balan-5a6d proto kernel metric 256 pref medium
fd00:0:0:1::/64 dev load-balan-3a3f proto kernel metric 256 pref medium
fe80::/64 dev eth0 proto kernel metric 256 pref medium
fe80::/64 dev ext-vlan.100 proto kernel metric 256 pref medium
fe80::/64 dev load-balan-5a6d proto kernel metric 256 pref medium
fe80::/64 dev load-balan-3a3f proto kernel metric 256 pref medium
local ::1 dev lo table local proto kernel metric 0 pref medium
anycast 100:100:: dev ext-vlan.100 table local proto kernel metric 0 pref medium
local 100:100::1 dev ext-vlan.100 table local proto kernel metric 0 pref medium
anycast fd00:: dev load-balan-5a6d table local proto kernel metric 0 pref medium
local fd00::4 dev load-balan-5a6d table local proto kernel metric 0 pref medium
anycast fd00:0:0:1:: dev load-balan-3a3f table local proto kernel metric 0 pref medium
local fd00:0:0:1::2 dev load-balan-3a3f table local proto kernel metric 0 pref medium
anycast fe80:: dev eth0 table local proto kernel metric 0 pref medium
anycast fe80:: dev ext-vlan.100 table local proto kernel metric 0 pref medium
anycast fe80:: dev load-balan-5a6d table local proto kernel metric 0 pref medium
anycast fe80:: dev load-balan-3a3f table local proto kernel metric 0 pref medium
local fe80::42:acff:fe12:5 dev ext-vlan.100 table local proto kernel metric 0 pref medium
local fe80::fe:4ff:fe84:48cb dev load-balan-5a6d table local proto kernel metric 0 pref medium
local fe80::fe:54ff:fe43:6876 dev load-balan-3a3f table local proto kernel metric 0 pref medium
local fe80::9ccc:93ff:fedb:e700 dev eth0 table local proto kernel metric 0 pref medium
multicast ff00::/8 dev eth0 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev ext-vlan.100 table local proto kernel metric 256 pref medium
multicast ff00::/8 dev load-balan-5a6d table local proto kernel metric 256 pref medium
multicast ff00::/8 dev load-balan-3a3f table local proto kernel metric 256 pref medium

STEP: Checking if all targets have receive traffic with no traffic interruption (no lost connection) 06/03/22 11:08:55.486
------------------------------
• [FAILED] [32.998 seconds]
Scaling
/home/lionelj/Workspaces/Meridio/test/e2e/scaling_test.go:32
  When trench is with 2 VIP addresses (20.0.0.1:5000, [2000::1]:5000) and 4 target pods running ctraffic
  /home/lionelj/Workspaces/Meridio/test/e2e/scaling_test.go:34
    when scaling targets up to 5
    /home/lionelj/Workspaces/Meridio/test/e2e/scaling_test.go:107
      [It] should receive the traffic correctly
      /home/lionelj/Workspaces/Meridio/test/e2e/scaling_test.go:111

  Begin Captured GinkgoWriter Output >>
    STEP: Waiting for the new targets to be registered 06/03/22 11:08:25.084
    STEP: Checking if all targets have receive traffic with no traffic interruption (no lost connection) 06/03/22 11:08:55.486
  << End Captured GinkgoWriter Output

  Expected
      <int>: 3
  to equal
      <int>: 5
  In [It] at: /home/lionelj/Workspaces/Meridio/test/e2e/scaling_test.go:157
LionelJouin commented 2 years ago

Here is new logs: traffic-disturbance.zip

The tests passed 8 times and failed on attempt 9.

Logs in the network directory have been collected after scaling to 5 and just before sending traffic that have failed to reach the 5 targets. At that time, the load-balancers can ping and send traffic to every TCP traffic to all targets (not shown in the logs, but I tried it).

Here are targets that have received traffic:

target-a-5cbdfc758-kc8dr
target-a-5cbdfc758-pqdns
target-a-5cbdfc758-x9nhq

And the ones that haven't:

target-a-5cbdfc758-8m6zm
target-a-5cbdfc758-hkcx8