Closed Swahjak closed 1 year ago
@Swahjak Thanks for raising an issue on the Cloud SQL Proxy 😄
handing this over to @hessjcg who is our GKE specialist, he should be able to shed some light on this and hopefully help you get to the bottom of additional errors being seen.
This is surprising.
What region do your GKE nodes live in compared to your instance?
CloudSQL is in europe-west4-b and GKE in europe-west4-a
Downgrading to 1.x.x does seem to reduce the amount of error messages, but we are still seeing couldn't connect to "toppy-***:europe-west4:toppy-***-database": dial tcp 35.2**.2**.2**:3307: connect: connection timed out
every now and then (different message, but I assume it's the same issue).
Screenshot from error reporting on one of our applications where we returned to 1.23.x without healthchecks, patched at ~12:30
Screenshot from error reporting on one of our applications where we returned to 1.33.x with healthchecks, patched at ~16:30
We're seeing the same results in the cloud-sql-proxy logs, where the amount of errors seems to align with the errors reported by the applications.
Normally i/o timeout errors indicate no network path from the Proxy to the instance. Assuming there are no changes in your environment to limit access to your public IP, I see one thing that might be worth trying: the resource limits are smaller than we usually recommend.
Compare:
resources:
requests:
cpu: 100m
memory: 100Mi
vs (what we recommend as a starting point):
resources:
requests:
# The proxy's memory use scales linearly with the number of active
# connections. Fewer open connections will use less memory. Adjust
# this value based on your application's requirements.
memory: "2Gi"
# The proxy's CPU use scales linearly with the amount of IO between
# the database and the application. Adjust this value based on your
# application's requirements.
cpu: "1"
I'd suggest starting with these larger values and then adjusting based on observed usage. This could be a case of CPU throttling.
@enocom I wil try, but we haven't had these issues before with the same settings (on 1.x.x). Given the proxies run as a sidecar with PHP FPM they will never have more active connections than the amount of PHP FPM processes, which is currently limited at 10. Using the suggested numbers would actually give the sidecar more resources than the actual application container. It also doesn't quite explain the difference between 1.x.x and 2.x.x(?). In any case, we'll give doubling the resources a try and see if it reduces the errors even further, which would make your case.
All good points. I'm wondering if the healthcheck in v2 is causing the difference.
@Swahjak thanks so much for posting this issue. I've been tearing my hair out trying to figure what I am doing wrong (and maybe I am still doing something wrong), but I am getting exactly the same message as you. I am setting up a new Kubernetes cluster from scratch and this is my first time hooking up to Cloud SQL, so I have no history to compare against.
I want to test by dropping back to the earlier version, what is the image you are using for the 1.x.x? so I can give it a go.
I have tried:
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:1.33.7
and
image: gcr.io/cloud-sql-docker/gce-proxy:1.33.7
but in each case I get image pull failures
Also, just a quick dumb question, but just to be sure, when I connect to the database, the name I pass is the name of a database on the database instance and not the name of the database instance? and I can connect on the standard port and using localhost
, I've tried both localhost
and 127.0.0.1
, I just want to make sure I've not made a silly snafu.
The Google docs are always just a little light on some of these more basic details.
Many thanks, Ross
Oh, I should also have said, I am using the default sidecar configuration as described in the Google docs here, so I have the recommended resources.
@RossAddinall the image pull issue is because you have a slight typo in your second attempt. (cloudsql-docker
instead of cloud-sql-docker
)
The v1 image should work with the following:
image: gcr.io/cloudsql-docker/gce-proxy:1.33.7
when I connect to the database, the name I pass is the name of a database on the database instance and not the name of the database instance?
Are you referring to the INSTANCE_CONNECTION_NAME
?
The INSTANCE_CONNECTION_NAME
for your instance can be found on the Overview page for your instance in the Google Cloud console or by running the following command:
gcloud sql instances describe <INSTANCE_NAME> --format='value(connectionName)'
Where INSTANCE_NAME
is the name of your Cloud SQL instance.
If you are referring to DB_NAME
when connecting that is the name of the actual database on the instance itself and not the instance name.
I can connect on the standard port and using
localhost
, I've tried bothlocalhost
and127.0.0.1
?
Yes both localhost
and 127.0.0.1
should work for connections. Although we normally recommend 127.0.0.1
.
Hope this helps clarify.
@jackwotherspoon many thanks for your help, in fact it turns out that it was my mistake, if you don't mind I'll document here what I did.
The Cloud SQL proxy is attempting to connect to the Public IP address of the Cloud SQL instance. I was unable to get to that IP from my container and edited my egress NAT policy following this. Once I added a CIDR for the Cloud SQL Public IP address (x.x.x.0/24), things started working (afaik).
I was actually referring to the DB_NAME and I was using the name of a database running on the database instance, though not initially!
This is what I did after the input from @jackwotherspoon:
Corrected the typo in the image I was pulling for the Cloud SQL proxy. When kubectl apply
ing the deployment I got CrashLoopBackOff errors, this is because the 1.x.x image appears to look for a command:
, instead of an args:
entry. I spotted this in this video and this is what I used:
image: gcr.io/cloudsql-docker/gce-proxy:1.33.7
command:
- /cloud_sql_proxy
- '-instances=<INSTANCE_CONNECTION_NAME>=tcp:5432'
when this finally ran I got an error message:
2023/06/15 13:33:20 current FDs rlimit set to 1048576, wanted limit is 8500. Nothing to do here.
2023/06/15 13:33:21 Listening on 127.0.0.1:5432 for xxx
2023/06/15 13:33:21 Ready for new connections
2023/06/15 13:33:21 Generated RSA key in 295.212694ms
2023/06/15 13:33:30 New connection for "xxx"
2023/06/15 13:33:30 refreshing ephemeral certificate for instance xxx
2023/06/15 13:33:30 Scheduling refresh of ephemeral certificate in 54m59s
2023/06/15 13:35:42 couldn't connect to "xxx": dial tcp x.x.x.x:3307: connect: connection timed out
this convinced me (more than the 2.x.x output), that this was a connectivity issue. It was also at this point that I realised that the IP address was the public IP of the Cloud SQL instance, which I hadn't grokked previously.
I then switched back to version 2.3.0 and when I remembered to also switch to the args
format, that all worked as well - so far.
I wanted to know if I could get to that IP from my container, so I did:
kubectl exec deployment-7744bdb78b-d74dl -i -t -- bash -il
I can't use ping in my container so I used:
wget -T 30 x.x.x.x:3307
and observed that it timed out.
I then decided I needed to configure my egress.
I looked at this which then lead me to this.
I created an egress-nat-policy.yaml
as described in that final document and I defined a CIDR for x.x.x.0/24 and applied it to my cluster:
kubectl apply -f egress-nat-policy.yaml
At this point the wget
failed instantly with error getting response
And my container started correctly and connected to the database.
Circling back to @Swahjak's original question. Are you seeing any improvement with increased resource requirements?
Hey @enocom thank you for getting back on this issue. I have been experimenting with some different resource assignments, but haven't seen any significant improvements (besides switching back to 1.x.x). I must say I haven't been fully focused on this issue due to other priorities, so I haven't been monitoring very closely.
Just a thought, but could connecting through a unix socket make any difference? Is that more / less resource intensive? In any case I'm never going to be able to mach the indicated resources since that would make cloudsqlproxy a very pricey tool (compared to the application infrastructure).
So I've been monitoring this more closely and actively changing resources, but it doesn't seem to make any difference. I'm starting to lean more towards a networking issue. Currently thinking about switching to a private IP on Cloud SQL to see if this makes any difference.
It's worth a try. I'm a bit stumped why you're seeing connection timeouts. That certainly usually indicates a network problem and secondly indicates a CPU resource problem.
Separately, if you have a support account, it would be worth opening a case, just so we could dig into the backend details some more in case there's something wrong here.
Last week we have migrated to private IP and using Google Cloud DNS with GKE and we haven't seen these problems any more. So I guess this was somehow related to DNS. Even returned the resource to the 'low' values (compared to the advised values).
Than means that from my perspective this can be marked as resolved. Thank you for your time and effort in helping me resolve this issue.
Does migrating to private IP mean you've removed the Proxy from the connectivity path?
I've heard anecdotal data that private IP is faster, but don't have any hard data on that.
@enocom no, the proxy is still there. We started using the --private-ip
flag now (and a private IP on the SQL instance obviously).
Ah got it. In that case I'll close this. Thanks for your patience on this.
Out of curiosity, how does Cloud DNS come into the picture?
@enocom since the problems seemed somehow 'connection' related I was looking into the GKE DNS settings, which now allows you to use 'Cloud DNS' instead of 'kube-dns'. I'm not sure if it was the private ip option (not going through egress/ingress NAT) or DNS which in the end resolved the issue since both were applied at the same time.
That's very interesting and good to know. Kubernetes networking is largely unknown to me, but I have anecdotal evidence like this that suggests it can affect connectivity (even though it seems like it shouldn't).
I know this is closed, but just wanted to add that I ran into this issue as well. Hopefully the extra details help!
I am running Kong inside our GKE clusters, and it was repeatedly crashing and restarting because of failures to connect to its database. We tried increasing memory and CPU on both the Kong pods and the CloudSQL instance, increasing connection timeouts from Kong to postgres, etc. Pretty much anything we could think of to increase we tried. All the metrics we were seeing were showing minimal usage across the board, and no real blocking happening on the CloudSQL instance (both cpu and io) either. We finally tried switching to a private IP on the CloudSQL instance yesterday and the errors stopped. We were getting about 40 error logs every 30 minutes, but since 3pm yesterday we haven't had a single one, which is great! But also confusing as to why that fixed it.
We are running the us.gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.6.0
image. We never really saw this be an issue on the v1 proxy image, but it's been awhile since we ran that.
Thanks for the info @jault3. It's helpful.
FWIW v1 proxy would try to dial public IP first and then fallback to private IP. V2 defaults to public and expects users to explicitly choose private IP (--private-ip
). Did your instance have a public IP when you saw these failures?
Yep, our instance had a public IP only when we saw the issues (private was disabled). When we switched, we enabled the private IP on the instance, then added the --private-ip
flag to our cloud-sql-proxy command.
Got it. It's unclear why public IP would cause this problem, but in any case private IP is a better path for both performance and security.
I don't want to dig up old issues, but we had this exact issue happen to us during a migration from an old dedicated server to GKE.
We had been using Google Cloud SQL for our production database for a few months now and had no issues connecting to it externally through a Public IP. So, fairly confident everything would go well, we moved our application to GKE.
This was our setup:
name: sqlauthproxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy
args:
- '--quiet'
- '--address=0.0.0.0'
- '--port=3306'
- '--credentials-file=/secrets/service_account.json'
- 'x:europe-west1:x'
sqlauthproxy
as the database hostEverything was fine during testing. Then we swapped our DNS to go from the old dedicated server to GKE, which caused all of our users to now go to the GKE cluster and suddenly the sqlauthproxy
logs were being spammed with:
failed to connect to instance: Dial error: failed to dial (connection name = "x:europe-west1:x"): dial tcp <public ip>:3307: i/o timeout
In the application this issue manifested itself as intermittent long running requests that blocked everything. For example, 20 requests would go through just fine, then 1 random request would hang for ~30 seconds, complete (successfully, so DB queries did go through), and then back to normal for the next x requests.
Monitoring showed acceptable load on the database and the containers. We debugged the entire pipeline from our ingress to nginx, php-fpm,... and quickly reverted our DNS back to the old server while we investigated the issue.
In Apache benchmark we could reproduce the issue with some simple load testing. It looked like this:
Percentage of the requests served within a certain time (ms)
50% 74
66% 80
75% 86
80% 92
90% 108
95% 174 <-- all good :D
98% 14200 <-- what the heck? >:(
99% 30309
100% 30318 (longest request)
To fix:
--private
flag to the proxy container argumentsNow the benchmarks have a more acceptable distribution (50% 100ms, 100% 3000ms).
We're going to try swapping the DNS back after running some more tests to see if this was the culprit of our botched migration or if we have to dig deeper.
tl;dr: Some networking issue between a private GKE cluster and a public SQL instance using sql auth proxy was causing ~2% of requests to time out when the application was under load, but it could not be reproduced if there was no load. Switching to private ip on the SQL instance fixed the issue in synthetic benchmarks.
Thanks for the detailed write-up @jannes-io. I think private IP is a better option in general, but it's concerning that we're seeing these high tail latencies.
I half wonder if Cloud NAT might be involved here (which would be at play for anyone running inside a private GKE cluster reaching out to a public IP). For example: https://cloud.google.com/nat/docs/troubleshooting#insufficient-ports.
Bug Description
Since we've upgraded from 1.*.* to 2.*.* we have been noticing 'a lot' of "failed to connect to instance: Dial error: failed to dial (connection name = "******"): dial tcp **.***.***.**:3307: i/o timeout" errors in the SQL Proxy container. We have been trying to debug this issue, but there doesn't really seem to be a clear explanation for it. It is happening across multiple (PHP) applications using different frameworks. Downgrading back to 1.*.* seems to resolve the issue, which (I think) rules out that it is network related. The database is not under (very) heavy load and we're only using ~5% of our connections limit.
Example code (or command)
Stacktrace
Steps to reproduce?
Environment
./cloud-sql-proxy --version
): 2.2.0./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME
):/cloud-sql-prox --health-check --http-address=0.0.0.0 --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json --max-sigterm-delay=60s --structured-logs --quiet toppy-***:europe-west4:toppy-***-database
Additional Details
Edit: updated yaml