GoogleCloudPlatform / cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Apache License 2.0
1.28k stars 349 forks source link

failed to connect to instance: Dial error: failed to dial #1840

Closed Swahjak closed 1 year ago

Swahjak commented 1 year ago

Bug Description

Since we've upgraded from 1.*.* to 2.*.* we have been noticing 'a lot' of "failed to connect to instance: Dial error: failed to dial (connection name = "******"): dial tcp **.***.***.**:3307: i/o timeout" errors in the SQL Proxy container. We have been trying to debug this issue, but there doesn't really seem to be a clear explanation for it. It is happening across multiple (PHP) applications using different frameworks. Downgrading back to 1.*.* seems to resolve the issue, which (I think) rules out that it is network related. The database is not under (very) heavy load and we're only using ~5% of our connections limit.

Example code (or command)

...
      - command:
        - /cloud-sql-proxy
        - --health-check
        - --http-address=0.0.0.0
        - --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json
        - --max-sigterm-delay=60s
        - --structured-logs
        - --quiet
        - toppy-***:europe-west4:toppy-***-database
        image: eu.gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.2.0
        imagePullPolicy: IfNotPresent
        name: cloudsqlproxy-container
        ports:
        - containerPort: 3306
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          allowPrivilegeEscalation: true
          privileged: false
          readOnlyRootFilesystem: false
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65532
        startupProbe:
          failureThreshold: 20
          httpGet:
            path: /startup
            port: 9090
            scheme: HTTP
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 5
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /secrets/cloudsql
          mountPropagation: None
          name: cloudsqlproxy-service-account
          readOnly: true
...

Stacktrace

n/a

Steps to reproduce?

  1. Not sure, included the sidecar yaml

Environment

  1. OS type and version: GKE 1.25.8-gke.500, Container-Optimized OS with containerd (cos_containerd)
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.2.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): /cloud-sql-prox --health-check --http-address=0.0.0.0 --credentials-file=/secrets/cloudsql/cloudsqlproxy-credentials.json --max-sigterm-delay=60s --structured-logs --quiet toppy-***:europe-west4:toppy-***-database

Additional Details

Edit: updated yaml

jackwotherspoon commented 1 year ago

@Swahjak Thanks for raising an issue on the Cloud SQL Proxy 😄

handing this over to @hessjcg who is our GKE specialist, he should be able to shed some light on this and hopefully help you get to the bottom of additional errors being seen.

enocom commented 1 year ago

This is surprising.

What region do your GKE nodes live in compared to your instance?

Swahjak commented 1 year ago

CloudSQL is in europe-west4-b and GKE in europe-west4-a

Swahjak commented 1 year ago

Downgrading to 1.x.x does seem to reduce the amount of error messages, but we are still seeing couldn't connect to "toppy-***:europe-west4:toppy-***-database": dial tcp 35.2**.2**.2**:3307: connect: connection timed out every now and then (different message, but I assume it's the same issue).

Screenshot from 2023-06-14 17-34-06 Screenshot from error reporting on one of our applications where we returned to 1.23.x without healthchecks, patched at ~12:30

Screenshot from 2023-06-14 17-33-46 Screenshot from error reporting on one of our applications where we returned to 1.33.x with healthchecks, patched at ~16:30

We're seeing the same results in the cloud-sql-proxy logs, where the amount of errors seems to align with the errors reported by the applications.

enocom commented 1 year ago

Normally i/o timeout errors indicate no network path from the Proxy to the instance. Assuming there are no changes in your environment to limit access to your public IP, I see one thing that might be worth trying: the resource limits are smaller than we usually recommend.

Compare:

resources:  
  requests:
    cpu: 100m
    memory: 100Mi

vs (what we recommend as a starting point):

resources:
  requests:
    # The proxy's memory use scales linearly with the number of active
    # connections. Fewer open connections will use less memory. Adjust
    # this value based on your application's requirements.
    memory: "2Gi"
    # The proxy's CPU use scales linearly with the amount of IO between
    # the database and the application. Adjust this value based on your
    # application's requirements.
    cpu:    "1"

I'd suggest starting with these larger values and then adjusting based on observed usage. This could be a case of CPU throttling.

Swahjak commented 1 year ago

@enocom I wil try, but we haven't had these issues before with the same settings (on 1.x.x). Given the proxies run as a sidecar with PHP FPM they will never have more active connections than the amount of PHP FPM processes, which is currently limited at 10. Using the suggested numbers would actually give the sidecar more resources than the actual application container. It also doesn't quite explain the difference between 1.x.x and 2.x.x(?). In any case, we'll give doubling the resources a try and see if it reduces the errors even further, which would make your case.

enocom commented 1 year ago

All good points. I'm wondering if the healthcheck in v2 is causing the difference.

RossAddinall commented 1 year ago

@Swahjak thanks so much for posting this issue. I've been tearing my hair out trying to figure what I am doing wrong (and maybe I am still doing something wrong), but I am getting exactly the same message as you. I am setting up a new Kubernetes cluster from scratch and this is my first time hooking up to Cloud SQL, so I have no history to compare against.

I want to test by dropping back to the earlier version, what is the image you are using for the 1.x.x? so I can give it a go.

I have tried:

image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:1.33.7

and

image: gcr.io/cloud-sql-docker/gce-proxy:1.33.7

but in each case I get image pull failures

Also, just a quick dumb question, but just to be sure, when I connect to the database, the name I pass is the name of a database on the database instance and not the name of the database instance? and I can connect on the standard port and using localhost, I've tried both localhost and 127.0.0.1, I just want to make sure I've not made a silly snafu.

The Google docs are always just a little light on some of these more basic details.

Many thanks, Ross

RossAddinall commented 1 year ago

Oh, I should also have said, I am using the default sidecar configuration as described in the Google docs here, so I have the recommended resources.

jackwotherspoon commented 1 year ago

@RossAddinall the image pull issue is because you have a slight typo in your second attempt. (cloudsql-docker instead of cloud-sql-docker)

The v1 image should work with the following:

image: gcr.io/cloudsql-docker/gce-proxy:1.33.7

when I connect to the database, the name I pass is the name of a database on the database instance and not the name of the database instance?

Are you referring to the INSTANCE_CONNECTION_NAME?

The INSTANCE_CONNECTION_NAME for your instance can be found on the Overview page for your instance in the Google Cloud console or by running the following command:

gcloud sql instances describe <INSTANCE_NAME> --format='value(connectionName)'

Where INSTANCE_NAME is the name of your Cloud SQL instance.

If you are referring to DB_NAME when connecting that is the name of the actual database on the instance itself and not the instance name.

I can connect on the standard port and using localhost, I've tried both localhost and 127.0.0.1?

Yes both localhost and 127.0.0.1 should work for connections. Although we normally recommend 127.0.0.1.

Hope this helps clarify.

RossAddinall commented 1 year ago

@jackwotherspoon many thanks for your help, in fact it turns out that it was my mistake, if you don't mind I'll document here what I did.

TLDR is:

The Cloud SQL proxy is attempting to connect to the Public IP address of the Cloud SQL instance. I was unable to get to that IP from my container and edited my egress NAT policy following this. Once I added a CIDR for the Cloud SQL Public IP address (x.x.x.0/24), things started working (afaik).

Answers to @jackwotherspoon question:

I was actually referring to the DB_NAME and I was using the name of a database running on the database instance, though not initially!

Longer version:

This is what I did after the input from @jackwotherspoon: Corrected the typo in the image I was pulling for the Cloud SQL proxy. When kubectl applying the deployment I got CrashLoopBackOff errors, this is because the 1.x.x image appears to look for a command:, instead of an args: entry. I spotted this in this video and this is what I used:

image: gcr.io/cloudsql-docker/gce-proxy:1.33.7
command:
  - /cloud_sql_proxy
  - '-instances=<INSTANCE_CONNECTION_NAME>=tcp:5432'

when this finally ran I got an error message:

2023/06/15 13:33:20 current FDs rlimit set to 1048576, wanted limit is 8500. Nothing to do here.
2023/06/15 13:33:21 Listening on 127.0.0.1:5432 for xxx
2023/06/15 13:33:21 Ready for new connections
2023/06/15 13:33:21 Generated RSA key in 295.212694ms
2023/06/15 13:33:30 New connection for "xxx"
2023/06/15 13:33:30 refreshing ephemeral certificate for instance xxx
2023/06/15 13:33:30 Scheduling refresh of ephemeral certificate in 54m59s
2023/06/15 13:35:42 couldn't connect to "xxx": dial tcp x.x.x.x:3307: connect: connection timed out

this convinced me (more than the 2.x.x output), that this was a connectivity issue. It was also at this point that I realised that the IP address was the public IP of the Cloud SQL instance, which I hadn't grokked previously.

I then switched back to version 2.3.0 and when I remembered to also switch to the args format, that all worked as well - so far.

I wanted to know if I could get to that IP from my container, so I did:

kubectl exec deployment-7744bdb78b-d74dl  -i -t -- bash -il

I can't use ping in my container so I used:

wget -T 30 x.x.x.x:3307

and observed that it timed out.

I then decided I needed to configure my egress.

I looked at this which then lead me to this.

I created an egress-nat-policy.yaml as described in that final document and I defined a CIDR for x.x.x.0/24 and applied it to my cluster:

kubectl apply -f egress-nat-policy.yaml

At this point the wget failed instantly with error getting response

And my container started correctly and connected to the database.

enocom commented 1 year ago

Circling back to @Swahjak's original question. Are you seeing any improvement with increased resource requirements?

Swahjak commented 1 year ago

Hey @enocom thank you for getting back on this issue. I have been experimenting with some different resource assignments, but haven't seen any significant improvements (besides switching back to 1.x.x). I must say I haven't been fully focused on this issue due to other priorities, so I haven't been monitoring very closely.

Just a thought, but could connecting through a unix socket make any difference? Is that more / less resource intensive? In any case I'm never going to be able to mach the indicated resources since that would make cloudsqlproxy a very pricey tool (compared to the application infrastructure).

Swahjak commented 1 year ago

So I've been monitoring this more closely and actively changing resources, but it doesn't seem to make any difference. I'm starting to lean more towards a networking issue. Currently thinking about switching to a private IP on Cloud SQL to see if this makes any difference.

enocom commented 1 year ago

It's worth a try. I'm a bit stumped why you're seeing connection timeouts. That certainly usually indicates a network problem and secondly indicates a CPU resource problem.

enocom commented 1 year ago

Separately, if you have a support account, it would be worth opening a case, just so we could dig into the backend details some more in case there's something wrong here.

Swahjak commented 1 year ago

Last week we have migrated to private IP and using Google Cloud DNS with GKE and we haven't seen these problems any more. So I guess this was somehow related to DNS. Even returned the resource to the 'low' values (compared to the advised values).

Than means that from my perspective this can be marked as resolved. Thank you for your time and effort in helping me resolve this issue.

enocom commented 1 year ago

Does migrating to private IP mean you've removed the Proxy from the connectivity path?

I've heard anecdotal data that private IP is faster, but don't have any hard data on that.

Swahjak commented 1 year ago

@enocom no, the proxy is still there. We started using the --private-ip flag now (and a private IP on the SQL instance obviously).

enocom commented 1 year ago

Ah got it. In that case I'll close this. Thanks for your patience on this.

Out of curiosity, how does Cloud DNS come into the picture?

Swahjak commented 1 year ago

@enocom since the problems seemed somehow 'connection' related I was looking into the GKE DNS settings, which now allows you to use 'Cloud DNS' instead of 'kube-dns'. I'm not sure if it was the private ip option (not going through egress/ingress NAT) or DNS which in the end resolved the issue since both were applied at the same time.

enocom commented 1 year ago

That's very interesting and good to know. Kubernetes networking is largely unknown to me, but I have anecdotal evidence like this that suggests it can affect connectivity (even though it seems like it shouldn't).

jault3 commented 1 year ago

I know this is closed, but just wanted to add that I ran into this issue as well. Hopefully the extra details help!

I am running Kong inside our GKE clusters, and it was repeatedly crashing and restarting because of failures to connect to its database. We tried increasing memory and CPU on both the Kong pods and the CloudSQL instance, increasing connection timeouts from Kong to postgres, etc. Pretty much anything we could think of to increase we tried. All the metrics we were seeing were showing minimal usage across the board, and no real blocking happening on the CloudSQL instance (both cpu and io) either. We finally tried switching to a private IP on the CloudSQL instance yesterday and the errors stopped. We were getting about 40 error logs every 30 minutes, but since 3pm yesterday we haven't had a single one, which is great! But also confusing as to why that fixed it.

We are running the us.gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.6.0 image. We never really saw this be an issue on the v1 proxy image, but it's been awhile since we ran that.

enocom commented 1 year ago

Thanks for the info @jault3. It's helpful.

FWIW v1 proxy would try to dial public IP first and then fallback to private IP. V2 defaults to public and expects users to explicitly choose private IP (--private-ip). Did your instance have a public IP when you saw these failures?

jault3 commented 1 year ago

Yep, our instance had a public IP only when we saw the issues (private was disabled). When we switched, we enabled the private IP on the instance, then added the --private-ip flag to our cloud-sql-proxy command.

enocom commented 1 year ago

Got it. It's unclear why public IP would cause this problem, but in any case private IP is a better path for both performance and security.

jannes-io commented 1 month ago

I don't want to dig up old issues, but we had this exact issue happen to us during a migration from an old dedicated server to GKE.

We had been using Google Cloud SQL for our production database for a few months now and had no issues connecting to it externally through a Public IP. So, fairly confident everything would go well, we moved our application to GKE.

This was our setup:

Everything was fine during testing. Then we swapped our DNS to go from the old dedicated server to GKE, which caused all of our users to now go to the GKE cluster and suddenly the sqlauthproxy logs were being spammed with:

failed to connect to instance: Dial error: failed to dial (connection name = "x:europe-west1:x"): dial tcp <public ip>:3307: i/o timeout

In the application this issue manifested itself as intermittent long running requests that blocked everything. For example, 20 requests would go through just fine, then 1 random request would hang for ~30 seconds, complete (successfully, so DB queries did go through), and then back to normal for the next x requests.

Monitoring showed acceptable load on the database and the containers. We debugged the entire pipeline from our ingress to nginx, php-fpm,... and quickly reverted our DNS back to the old server while we investigated the issue.

In Apache benchmark we could reproduce the issue with some simple load testing. It looked like this:

Percentage of the requests served within a certain time (ms)
  50%     74
  66%     80
  75%     86
  80%     92
  90%    108
  95%    174   <-- all good :D
  98%  14200 <-- what the heck? >:(
  99%  30309
 100%  30318 (longest request)

To fix:

Now the benchmarks have a more acceptable distribution (50% 100ms, 100% 3000ms).

We're going to try swapping the DNS back after running some more tests to see if this was the culprit of our botched migration or if we have to dig deeper.

tl;dr: Some networking issue between a private GKE cluster and a public SQL instance using sql auth proxy was causing ~2% of requests to time out when the application was under load, but it could not be reproduced if there was no load. Switching to private ip on the SQL instance fixed the issue in synthetic benchmarks.

enocom commented 1 month ago

Thanks for the detailed write-up @jannes-io. I think private IP is a better option in general, but it's concerning that we're seeing these high tail latencies.

I half wonder if Cloud NAT might be involved here (which would be at play for anyone running inside a private GKE cluster reaching out to a public IP). For example: https://cloud.google.com/nat/docs/troubleshooting#insufficient-ports.