NICMx / Jool

SIIT and NAT64 for Linux
GNU General Public License v2.0
332 stars 66 forks source link

Session Synchronization #385

Closed suffieldacademy closed 2 years ago

suffieldacademy commented 2 years ago

Hello,

We're busy setting up a multi-node multi-instance NAT64 cluster. Most of the pieces are in place, but we're having trouble with the session synchronization. I believe we've followed the instructions, but we are not seeing the session states from one node showing up on the other. Additionally, we're seeing session traffic from multiple instances even though we're only creating sessions in a single instance. I'm wondering if there might be an issue with per-instance sessions.

Quick background on the setup:

Both nodes have the following instances defined:

Both NAT64 instances have joold configured to run session synchronization:

    "ss-enabled": true,
    "ss-flush-asap": false,
    "ss-flush-deadline": 2000,
    "ss-max-payload": 1446

The instances use the same multicast destination, but different ports:

// instance nat64-wkp-lower
 "multicast address": "ff08::db8:64:64",
 "multicast port": "6240",
 "in interface": "eno3",
 "out interface": "eno3",
 "reuseaddr": 1
// instance nat64-wkp-upper
 "multicast address": "ff08::db8:64:64",
 "multicast port": "6241",
 "in interface": "eno3",
 "out interface": "eno3",
 "reuseaddr": 1

The two nodes are directly connected to each other via ethernet. We have not assigned an addresses; only link-local IPv6 are automatically assigned.

When we start the instances and generate traffic, the translation is occurring. On the node that is translating the traffic, we see session entries being generated:

Expires in 0:00:58.376
Remote: dns.google#38007    ...#42020
Local: ...#38007    64:ff9b::808:808#42020

Additionally, on the failover host we see the multicast packets arriving on the interface with the correct multicast destination and port number.

However, we are not seeing any sessions being created in the failover host (the session table is empty). Is there any debugging or other information we can enable to try to find where the packets might be getting lost?

One other oddity that we noticed is that even though our test traffic is only going through a single instance on the primary box, BOTH instances are generating session sync traffic. This happens even if we set ss-enabled=false on one of the instances (traffic is still generated to both ports). I'm wondering if perhaps joold is receiving session updates for all instances and forwarding them, rather than only propagating changes for a particular instance.

However, even if that were the case, I'm not sure why the other instances aren't seeing any sessions arrive (I would instead expect to see too many if all instances were generating duplicate traffic).

suffieldacademy commented 2 years ago

Brief update. We've peeled back as much of the configuration as possible, down to a single netfilter (not iptables) instance named "default", so we're as simple a setup as possible.

We are now seeing syslog entries from the primary host:

joold: Received a packet from kernelspace.
joold: Sending 280 bytes to the network...
joold: Sent 280 bytes to the network.

However, the failover machine is failing to process them, but is logging:

joold: Received 280 bytes from the network.
joold: Error receiving packet from kernelspace: Invalid input data or parameter

I'm not much of a C programmer, but looking for that error in the source brings me to usr/joold/modsocket.c, but I can't figure out much more from there.

We are isolating the forwarding interfaces, joold, etc all inside a network namespace (netns) as shown in the documentation. Is there any known odd behavior with netns and joold? Otherwise, I'm not sure why the packets aren't being processed.

suffieldacademy commented 2 years ago

Now that I found that more specific error, I see this is referenced in #362. I am running 4.1.5 on Debian stable, so I will try to upgrade to a more recent version and see if I can unravel this further.

suffieldacademy commented 2 years ago

OK, re-constituted the full multi-instance setup under v4.1.8 and having much better luck. Apologies for not starting with the most recent release, but usually try to stick to the Debian repos.

Sorry for the noise, but sometimes typing it all out helps me work through it!