acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
3.95k stars 736 forks source link

vrrp master switchover lead to disconnection of tcp connections #2254

Open maryoho opened 1 year ago

maryoho commented 1 year ago

Describe the bug

On my network experiment,I runned keepalived on two nat device to implement high availability. when master changed to another, a script will be executed to commit nat sessions to kernel. The sequence of actions:

  1. set vip to new master network interface
  2. send gratuitous ARP
  3. execute script The third step will take some time, it will lead to connection reset if client or server send packet before it's nat session not synced at this time. image

Expected behavior tcp connection not be resetted

Keepalived version

Keepalived v2.1.5 (07/13,2020)

Configuration file:

node1:
global_defs {
    router_id 172.18.0.2
    vrrp_skip_check_adv_addr
    vrrp_garp_master_refresh 60
    vrrp_garp_master_refresh_repeat 2
    vrrp_garp_master_repeat 5
    vrrp_garp_interval 0.001
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51

    unicast_src_ip 172.18.0.2

    unicast_peer {
        172.18.0.3
    }

    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }

    virtual_ipaddress {
        172.18.0.254
    }
    notify_master "/etc/conntrackd/primary-backup.sh primary"
    notify_backup "/etc/conntrackd/primary-backup.sh backup"
    notify_fault "/etc/conntrackd/primary-backup.sh fault"
}

node2:
global_defs {
    router_id 172.18.0.3
    vrrp_skip_check_adv_addr
    vrrp_garp_master_refresh 60
    vrrp_garp_master_refresh_repeat 2
    vrrp_garp_master_repeat 5
    vrrp_garp_interval 0.001
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51

    unicast_src_ip 172.18.0.3

    unicast_peer {
        172.18.0.2
    }

    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }

    virtual_ipaddress {
        172.18.0.254
    }
    notify_master "/etc/conntrackd/primary-backup.sh primary"
    notify_backup "/etc/conntrackd/primary-backup.sh backup"
    notify_fault "/etc/conntrackd/primary-backup.sh fault"
}

Notify and track scripts

https://github.com/vyos/conntrack-tools/blob/current/doc/sync/primary-backup.sh
maryoho commented 1 year ago

I made PR to solve this issue https://github.com/acassen/keepalived/pull/2255

pqarmitage commented 1 year ago

This doesn't relate to the issue, but there seem to be a couple of configuration errors in the config above:

  1. The VRRP instance on both nodes is configured to start in state MASTER, which cannot be correct. Simply delete the state MASTER lines, since it doesn't really do anything.
  2. Both VRRP instances have priority 100. The instances should have different priorities.
pqarmitage commented 1 year ago

I have been thinking further about this, and it seems to me that keepalived should be able to manage conntrackd since there must be many users who want to use conntrackd with keepalived for exactly the reasons you are using it.

The root of the problem is that if the VIPs are installed before the conntrack entries are installed into the kernel by conntrackd, then if a packet is received by the new master before the relevant conntrack entry is are installed, then the kernel sends an RST. You stated in issue #2254 that executing the primary-backup.sh script "takes some time", and due to this delay, there is a sufficient window for RSTs to be sent (I think the best reference to the script is https://git.netfilter.org/conntrack-tools/tree/doc/sync/primary-backup.sh).

primary-backup.sh does 4 things when a VRRP instance becomes master:

  1. conntrackd -c # commit the external cache into the kernel table
  2. conntrackd -f # flush the internal and the external caches
  3. conntrackd -R # resynchronize my internal cache to the kernel table
  4. conntrackd -B # send a bulk update to backups

I presume only the first command needs to complete before packets can be successfully handled by the kernel, and therefore only the first command needs to complete before the VIPs are used.

One thought I have had is that you could use nftables to drop packets until the conntrack entries are loaded. One way to do this, based on the configuration you provided above is:

  1. Add startup and shutdown scripts:
    global_defs {
    startup_script /etc/keepalived/startup-nft.sh
    shutdown_script /etc/keepalived/shutdown-nft.sh
    }

startup-nft.sh

#!/bin/bash

TABLE=keepalived-conntrack-VI_1

nft create table ip $TABLE 2>/dev/null
if [[ $? -eq 0 ]]; then
    nft add chain ip $TABLE drop-vips { type filter hook prerouting priority 100\; policy accept\; }
    nft add rule ip $TABLE drop-vips ip daddr { 172.18.0.254 } drop
fi

shutdown-nft.sh

#!/bin/bash

TABLE=keepalived-conntrack-VI_1

nft delete table ip $TABLE

This would require upgrading to at least keepalived v2.2.0 to support startup and shutdown scripts.

  1. Modify primary-backup.sh:
    
    diff --git a/primary-backup.sh b/primary-backup.sh
    index fb74adc..d65cb57 100644
    --- a/primary-backup.sh
    +++ b/primary-backup.sh
    @@ -23,6 +23,9 @@ CONNTRACKD_BIN=/usr/sbin/conntrackd
    CONNTRACKD_LOCK=/var/lock/conntrack.lock
    CONNTRACKD_CONFIG=/etc/conntrackd/conntrackd.conf

+NFT_BIN=/usr/sbin/nft +NFT_TABLE=keepalived-conntrack-VI_1 + case "$1" in primary) # @@ -34,6 +37,11 @@ case "$1" in logger "ERROR: failed to invoke conntrackd -c" fi

If you have more vrrp instances (e.g. for public and internal interfaces), then they should probably be in a sync group, and the notify scripts can be configured against the sync group (in which case more VIPs would need to be added to the list in the drop rule). Alternatively you can add a parameter to primary-backup.sh to indicate which VRRP instance the script is being run for and use a different table for each VRRP instance.

While this can work when the node is being used for NAT (and also for virtual_servers/real_servers), if the node is being used as a router (i.e the purpose for which VRRP was designed), then it may not necessarily work. It would work using VMACs, since the NFT rule could be used to drop based on the (virtual) destination MAC address), but it would be rather harder to work out to configure nftables is not using VMACs, although it may be possible in some circumstances using destination IP addresses.

The above approaches have the benefit that they can be implemented without modifying keepalived, however the specific nftables configurations could be quite difficult to work out. I think the best solution is to modify keepalived as follows:

  1. When keepalived becomes master it does everything as it does now (e.g. starts sending adverts), except adding the VIPs and sending gratuitous ARPs.
  2. Execute a script (which can execute conntrackd -c)
  3. After the script completes, add the VIPs and send gratuitous ARPs.
  4. Execute notify_master_post scripts (a new feature - this can execute the remaining conntrack commands).

An alternative is for keepalived to manage the calls to conntrackd. This has the advantage that if there are multiple VRRP instances for which transition to master state requires conntrackd commands to be executed, it can avoid conntrackd being called multiple times when there are simultaneous VRRP instance state transitions that require conntrackd to be invoked.

I will think further about this, and any thoughts you have about the above would be much appreciated. Further, if you are able to test the nftables idea I have outlined above that would be most helpful.

maryoho commented 1 year ago

follow your suggestion ,I adjust the script as follows: startup-nft.sh:

!/bin/bash

TABLE=keepalived-conntrack-VI_1
iptables -t raw -N $TABLE 2>/dev/null
iptables -t raw -A $TABLE -s 172.19.0.0/24 -j DROP 2>/dev/null

shutdown.sh:

!/bin/bash

TABLE=keepalived-conntrack-VI_1
iptables -t raw -F
iptables -t raw -X $TABLE

primary-backup.sh: IPTABLES_BIN=/usr/sbin/iptables TABLE=keepalived-conntrack-VI_1

case "$1" in
  primary)
    $IPTABLES_BIN -t raw -A PREROUTING -j $TABLE
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -c
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -c"
    fi
    $IPTABLES_BIN -t raw -F PREROUTING

on my test environment, nft is not available, I use iptables instead Is there something I'm not got?

In my test, Tcp disconnections occurred too. It seems that this solution does not work well.

pqarmitage commented 1 year ago

In startup-nft.sh you probably need to add: iptables -t raw -A PREROUTING -j $TABLE so that the rule is enabled before the VRRP instance first becomes master.

In the backup section of primary-backup.sh you also need to add: iptables -t raw -A PREROUTING -j $TABLE

It may be that with the iptables command in the backup) section, the addition isn't needed in startup-nft.sh