🐛 HTTP Response Code: 502 during rolling updates of Kubernetes deployment

maggie44 commented 11 months ago

Describe the bug

I am running multiple replicas of Cloudflared on a Kubernetes cluster to serve traffic to a Kubernetes service. When I restart the deployment the tunnels initiate a rolling update, starting a new instance before terminating the running one. Each of the Cloudflared instances appears to be exiting gracefully as expected:

2024-01-11T02:25:21Z INF Initiating graceful shutdown due to signal terminated ...
2024-01-11T02:25:21Z INF Unregistered tunnel connection connIndex=1 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z ERR Failed to serve quic connection error="context canceled" connIndex=1 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Retrying connection in up to 1s connIndex=1 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Unregistered tunnel connection connIndex=3 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z ERR Failed to serve quic connection error="context canceled" connIndex=3 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Retrying connection in up to 1s connIndex=3 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Unregistered tunnel connection connIndex=0 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z ERR Failed to serve quic connection error="context canceled" connIndex=0 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Retrying connection in up to 1s connIndex=0 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Unregistered tunnel connection connIndex=2 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z ERR Failed to serve quic connection error="context canceled" connIndex=2 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z INF Retrying connection in up to 1s connIndex=2 event=0 ip=x.x.x.x
2024-01-11T02:25:21Z ERR no more connections active and exiting
2024-01-11T02:25:21Z INF Tunnel server stopped
2024-01-11T02:25:21Z ERR icmp router terminated error="context canceled"
2024-01-11T02:25:21Z INF Metrics server stopped

I have the probes set up for starting new replicas as follows. I can observe them taking effect as expected:

          livenessProbe:
            httpGet:
              path: /ready
              port: 2000
          readinessProbe:
            httpGet:
              path: /ready
              port: 2000
          startupProbe:
            httpGet:
              path: /ready
              port: 2000

I have minReplicas: 3 and the following rolling update strategy:

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

When I observe the rolling update I can see all three instances running, then it scaling up and it waiting 10 seconds as expected for the startup probe before exiting the old instance.

From a Kubernetes perspective everything seems to be working as expected. Yet when trying to access through the domain name during these restarts I keep encountering HTTP Response Code: 502:

This is through a Cloudflare WARP installation. I can replicate it through an online domain (also with Cloudflare) as well though, although instead of the 502 as above one of the refreshes in my browser never makes it.

Expected Behaviour

With the graceful shutdowns happening as expected I had expected Cloudflared to inform Cloudflare services to not relay any requests to the shutdown/shutting down Cloudflared instances.

Environment and versions Kubernetes 1.29 Cloudflared: 2024.1.1

maggie44 commented 11 months ago

Here is a bash script for testing:

#!/bin/bash

# Set the API URL
api_url="https://api-url-to-test"

# Function to check API status
check_api_status() {
    # Use curl to make a GET request to the API URL
    response_code=$(curl -s -o /dev/null -w "%{http_code}" "$api_url")

    # Check if the response code is 200 (OK)
    if [ "$response_code" -eq 200 ]; then
        echo "$response_code"
    else
        echo "API is not reachable. Response code: $response_code"
    fi
}

# Infinite loop
while true; do
    start_time=$(date +%s.%N)  # Get the start time

    # Call the function to check API status
    check_api_status

    end_time=$(date +%s.%N)    # Get the end time
    duration=$(echo "$end_time - $start_time" | bc)  # Calculate the duration

    echo "Loop took $duration seconds."

    # Sleep for 0.5 seconds before the next iteration
    sleep 0.5
done

During the restart one of the requests drops (i.e. active users receive errors):

Loop took 0 seconds.
200
Loop took 1.0 seconds.
API is not reachable. Response code: 503
Loop took 0 seconds.
200

joliveirinha commented 11 months ago

Hi,

Can you provide us your tunnel id?

You mention that the traffic you are doing is via WARP ? but it was not clear if you are exposing your origins via Public Hostnames (ingresses), or via private networking.

Also, from your experience, how long is it taking for the 502 to be solved after the new pods are rolled out?

maggie44 commented 11 months ago

Hi,

Can you provide us your tunnel id?

You mention that the traffic you are doing is via WARP ? but it was not clear if you are exposing your origins via Public Hostnames (ingresses), or via private networking.

Also, from your experience, how long is it taking for the 502 to be solved after the new pods are rolled out?

Using WARP to connect my Kubernetes cluster to my local system through Zero Access. Then using .local addresses. That is where I see the Cloudlfare Gateway image above. That is on ID: 5e638397-84c6-4a8c-a37f-6e13fc3b31cd

Then the bash script test running is to a public domain name hosted by Cloudflare and tunnelled inside the Kubernetes cluster to a service, like an ordinary we service. That one is ID: 60f0a274-c3d2-4f01-9ae9-fc51bfa70252

On the ping test in bash it only drops once in that test. One the local WARP setup it took maybe five-ten seconds, sometimes the Cloudflare image above, sometimes the browser trying to work out what's going on and it sitting there for a while.

This was all experienced on those tunnels around the time of my posts above.

joliveirinha commented 11 months ago

So, the initial findings, we can see that multiple cloudflared instances disconnected and connected in both tunnels. Soem connections lasted less than 1 minute.

What I thing maybe happening is that because the instances seem to be churning, our edge caches that information for some time (seconds). This includes positives matches for connections and also negative matches (no connections found). This explains why in this case where there is some churning it causes those issues.

I would suggest investigate why there was so many restarts. For instance, for tunnel 5e638397-84c6-4a8c-a37f-6e13fc3b31cd we can see that at 2:58 am, there was close to 15 instances for cloudflared connecting and disconnecting around that time, before a final set of 3 became stable. All of this restarts, are reported as cloudflared "going away".

A second step, maybe to increase the time for the rollout to decrease the probability this issue. But I would spend some time on the first step first.

maggie44 commented 11 months ago

The restarts were me trying to replicate the issue. This initially occurred around 1am UTC and will likely be the first time it restarted. My Nodes began executing updates and restarting themselves one at a time (as planned) which involved cordoning and draining. I was working at the time on the dev environment and starting seeing the 502 page and started to explore why, and began manually triggering rolling updates to discover what was going on.

The rollout is current a minimum of ten seconds as the startup probe has a default of 10s. Which means it scales up to 4 or more Cloudflared and waits at least 10 seconds before hitting the /ready endpoint to check it is healthy, then gracefully exits the old one. There will be a lot of tunnels up and down around the same 1 minute period as it is rolling updates.

I will try again right now as there shouldn't have been any tunnels up and down for some time.

there was close to 15 instances for cloudflared connecting and disconnecting around that time

There shouldn't ever have been any way for 15 instances to be active, the very max could only ever be 6. But I assume that means that at least 9 (most likely 11) were disconnected and inaccessible as I can't think of a way that there couldn't have been more than 6 active at any one time.

maggie44 commented 11 months ago

A bunch worked ok without issue. Then I did manage to replicate it again just before 13:29:12 WET (60f0a274-c3d2-4f01-9ae9-fc51bfa70252):

API is not reachable. Response code: 503

I don't fully understand the caching thing you mentioned of course, but it does speak to your theory, that it is when there are many restarts. Other than that this occurred initially on the first restart attempt that was part of the automation.

I will leave it for an hour which hopefully lets caches expire then will try again.

Lots of these Serve tunnel error error="failed to accept QUIC stream: timeout: no recent network activity" in the logs, but haven't been able to identify that it is related:

2024-01-11T13:41:39Z WRN Serve tunnel error error="failed to accept QUIC stream: timeout: no recent network activity" connIndex=2 event=0 ip=198.41.192.27

maggie44 commented 11 months ago

I managed to replicate it first time this time (60f0a274-c3d2-4f01-9ae9-fc51bfa70252):

Thu 11 Jan 2024 20:24:26 WET
API is not reachable. Response code: 503

It had been running all day, hasn't been any restarts for around 6 hours. I triggered a rolling restart of 3 Cloudflared containers on the cluster. It scaled up and then started exiting the existing containers.

3 is not that many, it's what I am running for the dev environment. I would expect to be able to restart many more than this without 503 errors (or 502 in the case of the WARP scenario).

maggie44 commented 1 month ago

Likely closed by: https://github.com/cloudflare/cloudflared/commit/e251a218104ae044ee5af35ae7a0a1bc7fad36c8

cloudflare / cloudflared

🐛 HTTP Response Code: 502 during rolling updates of Kubernetes deployment #1147