Closed maggie44 closed 1 month ago
Here is a bash script for testing:
#!/bin/bash
# Set the API URL
api_url="https://api-url-to-test"
# Function to check API status
check_api_status() {
# Use curl to make a GET request to the API URL
response_code=$(curl -s -o /dev/null -w "%{http_code}" "$api_url")
# Check if the response code is 200 (OK)
if [ "$response_code" -eq 200 ]; then
echo "$response_code"
else
echo "API is not reachable. Response code: $response_code"
fi
}
# Infinite loop
while true; do
start_time=$(date +%s.%N) # Get the start time
# Call the function to check API status
check_api_status
end_time=$(date +%s.%N) # Get the end time
duration=$(echo "$end_time - $start_time" | bc) # Calculate the duration
echo "Loop took $duration seconds."
# Sleep for 0.5 seconds before the next iteration
sleep 0.5
done
During the restart one of the requests drops (i.e. active users receive errors):
Loop took 0 seconds.
200
Loop took 1.0 seconds.
API is not reachable. Response code: 503
Loop took 0 seconds.
200
Hi,
Can you provide us your tunnel id?
You mention that the traffic you are doing is via WARP ? but it was not clear if you are exposing your origins via Public Hostnames (ingresses), or via private networking.
Also, from your experience, how long is it taking for the 502 to be solved after the new pods are rolled out?
Hi,
Can you provide us your tunnel id?
You mention that the traffic you are doing is via WARP ? but it was not clear if you are exposing your origins via Public Hostnames (ingresses), or via private networking.
Also, from your experience, how long is it taking for the 502 to be solved after the new pods are rolled out?
Using WARP to connect my Kubernetes cluster to my local system through Zero Access. Then using .local addresses. That is where I see the Cloudlfare Gateway
image above. That is on ID: 5e638397-84c6-4a8c-a37f-6e13fc3b31cd
Then the bash script test running is to a public domain name hosted by Cloudflare and tunnelled inside the Kubernetes cluster to a service, like an ordinary we service. That one is ID: 60f0a274-c3d2-4f01-9ae9-fc51bfa70252
On the ping test in bash it only drops once in that test. One the local WARP setup it took maybe five-ten seconds, sometimes the Cloudflare image above, sometimes the browser trying to work out what's going on and it sitting there for a while.
This was all experienced on those tunnels around the time of my posts above.
So, the initial findings, we can see that multiple cloudflared instances disconnected and connected in both tunnels. Soem connections lasted less than 1 minute.
What I thing maybe happening is that because the instances seem to be churning, our edge caches that information for some time (seconds). This includes positives matches for connections and also negative matches (no connections found). This explains why in this case where there is some churning it causes those issues.
I would suggest investigate why there was so many restarts. For instance, for tunnel 5e638397-84c6-4a8c-a37f-6e13fc3b31cd we can see that at 2:58 am, there was close to 15 instances for cloudflared connecting and disconnecting around that time, before a final set of 3 became stable. All of this restarts, are reported as cloudflared "going away".
A second step, maybe to increase the time for the rollout to decrease the probability this issue. But I would spend some time on the first step first.
The restarts were me trying to replicate the issue. This initially occurred around 1am UTC and will likely be the first time it restarted. My Nodes began executing updates and restarting themselves one at a time (as planned) which involved cordoning and draining. I was working at the time on the dev environment and starting seeing the 502 page and started to explore why, and began manually triggering rolling updates to discover what was going on.
The rollout is current a minimum of ten seconds as the startup probe has a default of 10s. Which means it scales up to 4 or more Cloudflared and waits at least 10 seconds before hitting the /ready
endpoint to check it is healthy, then gracefully exits the old one. There will be a lot of tunnels up and down around the same 1 minute period as it is rolling updates.
I will try again right now as there shouldn't have been any tunnels up and down for some time.
there was close to 15 instances for cloudflared connecting and disconnecting around that time
There shouldn't ever have been any way for 15 instances to be active, the very max could only ever be 6. But I assume that means that at least 9 (most likely 11) were disconnected and inaccessible as I can't think of a way that there couldn't have been more than 6 active at any one time.
A bunch worked ok without issue. Then I did manage to replicate it again just before 13:29:12 WET
(60f0a274-c3d2-4f01-9ae9-fc51bfa70252
):
API is not reachable. Response code: 503
I don't fully understand the caching thing you mentioned of course, but it does speak to your theory, that it is when there are many restarts. Other than that this occurred initially on the first restart attempt that was part of the automation.
I will leave it for an hour which hopefully lets caches expire then will try again.
Lots of these Serve tunnel error error="failed to accept QUIC stream: timeout: no recent network activity"
in the logs, but haven't been able to identify that it is related:
2024-01-11T13:41:39Z WRN Serve tunnel error error="failed to accept QUIC stream: timeout: no recent network activity" connIndex=2 event=0 ip=198.41.192.27
I managed to replicate it first time this time (60f0a274-c3d2-4f01-9ae9-fc51bfa70252
):
Thu 11 Jan 2024 20:24:26 WET
API is not reachable. Response code: 503
It had been running all day, hasn't been any restarts for around 6 hours. I triggered a rolling restart of 3 Cloudflared containers on the cluster. It scaled up and then started exiting the existing containers.
3 is not that many, it's what I am running for the dev environment. I would expect to be able to restart many more than this without 503 errors (or 502 in the case of the WARP scenario).
Describe the bug
I am running multiple replicas of Cloudflared on a Kubernetes cluster to serve traffic to a Kubernetes service. When I restart the deployment the tunnels initiate a rolling update, starting a new instance before terminating the running one. Each of the Cloudflared instances appears to be exiting gracefully as expected:
I have the probes set up for starting new replicas as follows. I can observe them taking effect as expected:
I have
minReplicas: 3
and the following rolling update strategy:When I observe the rolling update I can see all three instances running, then it scaling up and it waiting 10 seconds as expected for the startup probe before exiting the old instance.
From a Kubernetes perspective everything seems to be working as expected. Yet when trying to access through the domain name during these restarts I keep encountering
HTTP Response Code: 502
:This is through a Cloudflare WARP installation. I can replicate it through an online domain (also with Cloudflare) as well though, although instead of the 502 as above one of the refreshes in my browser never makes it.
Expected Behaviour
With the graceful shutdowns happening as expected I had expected Cloudflared to inform Cloudflare services to not relay any requests to the shutdown/shutting down Cloudflared instances.
Environment and versions Kubernetes 1.29 Cloudflared: 2024.1.1