TCP disconnection upon master switchover

Describe the bug
Despite using Keepalived with Conntrackd, the Master switchover leads to TCP disconnection.
See the to reproduce (next) section for the setup and issues.

To Reproduce
Any steps necessary to reproduce the behaviour:

Machine 1: HAProxy 1: 192.168.1.26 (VIP: 192.168.1.210)
Machine 2: HAProxy 2: 192.168.1.27 (VIP: 192.168.1.210)
Machine 3: Nginx 1: 192.168.1.28
Machine 4: Nginx 2: 192.168.1.29
Machine 5: Client: 192.168.1.213

I found two scenarios with two different outcomes:

When sending request with apache benchmark from the client (192.168.1.213) to the VIP (192.168.1.210), if I stop the keepalived service, then the TCP connection is left hanging. From there two options (note that in both scenarios, the TCP connection is never switched over to the new Master instance):
- Scenarii 1.1: If I restart the keepalived service prior to the 30sec, all requests left hanging are served once the initial master goes back online (e.g. connection did not have the time to timeout).
- Scenarii 1.2: If I do not restart the service, after the delay set by the client (apache benchmark, ab, has a default of 30sec), then the timeout is reached apr_pollset_poll: The timeout specified has expired (70007). The remaining node (initially backup, now master) never continues the TCP connection. Upon monitoring the traffic the only traffic I see is a [FIN,ACK] from the client and the node replying a RST after the timeout (30s here, so with a service shutdown at t=5s, I got it at t~=35s).

In both cases (1.1, 1.2), I do not understand why the TCP connection does not continue after the switchover.

When blasting request with apache benchmark, if I shutdown the master host, the TCP connection immediately shuts down: apr_socket_recv: Connection reset by peer (104). The client immediately receives a [RST, ACK].

In all cases, I see nothing alarming in the logs, and the switchover seems to work. If I query the virtual IP address after shutting down the master host or the keepalived service on the master host, request are served successfully via the backup (now master) node. So, the switchover happens, just not at the TCP level. On the conntrackd end, if list connections with sudo conntrackd -i, I can see a lot of TCP connections replicated.

I would have assumed there's something wrong my configuration, except that I have proof that these exact same configurations worked. I have one run where the switchover was successfully made, with no interruption of the TCP connection.
Below, I highlight the exact request at which the takeover between HAP1 and HAP2 happened. Seems like right at the end of a communication ([FIN, ACK]). I guess it's perfect timing?

Expected behavior
TCP connection should not end or be left hanging, they should be served by the remaining node upon failover.

Keepalived version

Keepalived v2.2.8 (04/04,2023), git commit v2.2.7-154-g292b299e+

Distro (please complete the following information):

Name: Ubuntu
Version: 22.04.4 LTS
Architecture: x86_64

Details of any containerization or hosted service
Ran inside VMs, the hypervisor in use is XCP-NG.

Did keepalived coredump?
No.

Additional context
I don't think it's similar to #2254, but I could be wrong?

@EVMBugs There doesn't appear to be anything wrong with your keepalived configs. I do note that you have modified the path to the conntrackd binary from /usr/sbin/conntrackd to /usr/local/sbin/conntrackd. When I install the conntrackd package on Ubuntu 22.04 the binary is installed to /usr/sbin.

From the logs you provide, when the VRRP instance transitions from backup to master keepalived does appear to be running the primary-backup.sh script, and that in turn appears to be executing conntrackd.

The Wireshark extracts you provide are interesting. pkts 2515-6: It appears that neither end of the connection has sent any data (seq=1, ack=1).

pkts 439...: You refer to blasting request with apache benchmark. Only 81 bytes have been received on the server side.

pkts 370...: The connection between ports 80 and 49274 is shown shutting down from the client side, but there is no server side shutdown shown. The connection between ports 80 and 49280 is successfully set up, and the client sends data, but no server side response to that is shown.

You don't state on which system the Wireshark extracts were captured - was it on the client or on hap1 or 2?

I am confused about your reference to machines 3 and 4 (Nginx 1 and 2) since I can see nowhere where they are involved, and nowhere where their IP addresses are shown being used. Given what you have set out, I would expect to see some IPVS configuration so that packets sent to 192.168.1.210 are forwarded to 192.168.1.28 or 192.168.1.29.

You state that conntrackd -i shows a lot of TCP connections. What is important, however, is what is in the kernel conntrack tables, and for this you need to use the conntrack program rather than the conntrackd program.

I think, after hap2 has taken over as master, you need to capture packets on hap1 and hap2 to see what they are doing (you probably need to show the MAC addresses as well since 192.168.1.210 will have different MAC addresses depending whether it is on hap1 or hap2. You also need to look at the conntrack tables on hap2 to ensure that it has the relevant entries for the connections in progress.

I do note that you have modified the path to the conntrackd binary from /usr/sbin/conntrackd to /usr/local/sbin/conntrackd. When I install the conntrackd package on Ubuntu 22.04 the binary is installed to /usr/sbin.

From my experiments, I've noticed that if you install conntrackd using aptitude on U22.04, the binary can be found in /usr/sbin. However, if you compile it from sources, it will be in /usr/local/sbin/conntrackd instead. Here, I compiled the binary from the sources.

pkts 2515-6 is the scenario in which I stop the keepalived service, and let the connection timeout. The dump was captured on the client (192.168.1.213) that is sending the requests using apache benchmark.

pkts 439...: This scenario is the one in which I shut down the master keepalived node (hap1, 192.168.1.26) and the connection breaks immediately upon shut down. This dump was also captured on the client (192.168.1.213). By blasting, I meant that I'm sending thousands of requests. 81 bytes seems to match with the expected TCP payload being sent to the server: GET / HTTP/1.0 Host: 192.168.1.210 User-Agent: ApacheBench/2.3 Accept: */*.

pkts 370...: This scenario is the one in which I shut down the master keepalived node (hap1, 192.168.1.26) and the connection continued. This dump was also captured on the client (192.168.1.213). There is a server side response. I did not include it in the dump as I did not want to bury you with details. I added more detail with respect to that connection below

Machines 3 and 4 (Nginx 1 and 2) and two simple stupid web server that serve the default Nginx page. When a request comes in any of the two HAProxy (machine 1 and 2, aka HAProxy 1 and 2), the request is dispatched to either machine 3 or machine 4. Sorry if that was not clear. Ofc, I made sure the setup with the two haproxy and the two nginx was working prior to setting up keepalived+conntrackd.
So any request from the client will flow in the following way: 192.168.1.213 --> 192.168.1.210 --> 192.168.1.26 (if hap1 is the master) --> 192.168.1.28 (assuming machine 1 (hap1) dispatched the request to machine 3 (nginx1))

Nominal Case (pkts 370...) In this case, everything went well and the connection was switched seamlessly upon the failure of hap1 (simulated by a sudo poweroff without any notice). The goal is that this behavior should always happen.

@EVMBugs You write but I do not know if that's the expected behavior between keepalived+conntrack. This is a misunderstanding; there is no connection between keepalived and conntrack (or conntrackd).

I have never used HAProxy, but I have done some looking into it. I don't think HAProxy supports what you want to achieve. My understanding is that HAProxy receives an incoming TCP connection, creates a corresponding connection to a backend server, and traffic arrives on one TCP connection, passes through HAProxy (which may modify the data) and then sends it out on the other TCP connection. In other words, there is one TCP connection between the client and, say, hap1, and a different TCP connection between hap1 and, say, Nginx1; incoming data on one connection is sent out on the other TCP connection. Now suppose hap1 fails. keepalived can transfer the IP address (192.168.1.210) from hap1 to hap2, and conntrackd can update the kernel connection tracking system, but there is no corresponding TCP connection between hap2 and nginx1, and so the connection from the client will be broken.

In your examples above, my understanding is that the apache benchmark you are running is creating very short duration TCP connections, i.e. TCP open, HTTP GET, HTTP response (1 packet), TCP close. Where you appear to have had successful test results it can only be because at the time of the failure/shutdown on hap1, there happened to be no open TCP connections at the time.

In order to achieve what I understand you are wanting to do, you need to use IPVS, or an equivalent. keepalived supports IPVS with its virtual server/real server configuration, and also has checkers to monitor the availability of the real servers (in your case the nginx systems). IPVS also supports load balancing.

HAProxy has additional functionality over using IPVS, but so far as I can see if you want to be able to survive the failure of one of your load balancing systems and keep TCP connections open, you cannot use it.

acassen / keepalived

TCP disconnection upon master switchover #2414