GoogleCloudPlatform / cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Apache License 2.0
1.27k stars 346 forks source link

30s+ Hang When Using Manual Token Authentication #2224

Closed Cooksauce closed 4 months ago

Cooksauce commented 4 months ago

Bug Description

When using token auth with IAM (e.g. --auto-iam-authn --token --login-token), the proxy takes an unreasonably long time to do the IAM postgres authentication. This can be seen in the logs attached when using --run-connection-test.

This long delay also happens each time a postgres client tries to access the db thru the proxy - including subsequent accesses. E.g. psql ... takes 30s+ to get into the postgres> repl

The same behavior is not observed when authenticating via Application Default Credentials.

Example code (or command)

No response

Stacktrace

No response

Steps to reproduce?

  1. Login via gcloud:gcloud auth login
  2. Run proxy

Environment

  1. OS type and version: MacOS 14.4.1
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.10.1
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME):
    ./cloud-sql-proxy-v2.10.1 --auto-iam-authn --run-connection-test \
      --token "$(gcloud auth print-access-token)" \
      --login-token "$(gcloud sql generate-login-token)" \
      --debug-logs \
      my-project:us-central1:my-instance

Additional Details

Debug Logs:

Using ADC:

./cloud-sql-proxy-v2.10.1 --auto-iam-authn --run-connection-test --debug-logs \
my-project:us-central1:my-instance
2024/05/16 16:26:30 Authorizing with Application Default Credentials
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Connection info added to cache
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Connection info refresh operation started
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Connection info refresh operation complete
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Current certificate expiration = 2024-05-16T22:26:30Z
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Connection info refresh operation scheduled at 2024-05-16T22:22:30Z (now + 56m0s)
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Listening on 127.0.0.1:5432
2024/05/16 16:26:30 The proxy has started successfully and is ready for new connections!
2024/05/16 16:26:30 Connection test started
2024/05/16 16:26:30 [my-project:us-central1:my-instance] Connection info refresh operation started
2024/05/16 16:26:31 [my-project:us-central1:my-instance] Connection info refresh operation complete
2024/05/16 16:26:31 [my-project:us-central1:my-instance] Current certificate expiration = 2024-05-16T22:26:29Z
2024/05/16 16:26:31 [my-project:us-central1:my-instance] Connection info refresh operation scheduled at 2024-05-16T22:22:29Z (now + 56m0s)
2024/05/16 16:26:31 [my-project:us-central1:my-instance] Now = 2024-05-16T21:26:31Z, Current cert expiration = 2024-05-16T22:26:29Z
2024/05/16 16:26:31 [my-project:us-central1:my-instance] Cert is valid = true
2024/05/16 16:26:31 [my-project:us-central1:my-instance] Dialing xx.xx.xx.xx:3307
2024/05/16 16:26:31 Connection test passed
^C2024/05/16 16:26:33 SIGINT signal received. Shutting down...

Notice the connection test passes immediately


Using Manual Access Tokens

./cloud-sql-proxy-v2.10.1 --auto-iam-authn --run-connection-test \
   --token "$(gcloud auth print-access-token)" \
   --login-token "$(gcloud sql generate-login-token)" \
   --debug-logs \
   my-project:us-central1:my-instance
2024/05/16 16:27:01 Authorizing with OAuth2 token
2024/05/16 16:27:01 [my-project:us-central1:my-instance] Connection info added to cache
2024/05/16 16:27:01 [my-project:us-central1:my-instance] Connection info refresh operation started
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Connection info refresh operation complete
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Current certificate expiration = 2024-05-16T22:27:02Z
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Connection info refresh operation scheduled at 2024-05-16T22:23:02Z (now + 56m0s)
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Listening on 127.0.0.1:5432
2024/05/16 16:27:02 The proxy has started successfully and is ready for new connections!
2024/05/16 16:27:02 Connection test started
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Connection info refresh operation started
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Connection info refresh operation complete
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Current certificate expiration = 0001-01-01T00:00:00Z
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Connection info refresh operation scheduled at 2024-05-16T21:27:02Z (now + 0s)
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Connection info refresh operation started
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Now = 2024-05-16T21:27:02Z, Current cert expiration = 0001-01-01T00:00:00Z
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Cert is valid = false
2024/05/16 16:27:02 [my-project:us-central1:my-instance] Refreshing certificate now
2024/05/16 16:27:32 [my-project:us-central1:my-instance] Connection info refresh operation complete
2024/05/16 16:27:32 [my-project:us-central1:my-instance] Current certificate expiration = 0001-01-01T00:00:00Z
2024/05/16 16:27:32 [my-project:us-central1:my-instance] Connection info refresh operation scheduled at 2024-05-16T21:27:32Z (now + 0s)
2024/05/16 16:27:32 [my-project:us-central1:my-instance] Connection info refresh operation started
2024/05/16 16:27:32 [my-project:us-central1:my-instance] Dialing xx.xx.xx.xx:3307
2024/05/16 16:27:32 Connection test passed
^C2024/05/16 16:27:42 SIGINT signal received. Shutting down...
2024/05/16 16:27:42 [my-project:us-central1:my-instance] Connection info refresh operation failed, err = Dial error: context was canceled or expired before refresh completed (connection name = "my-project:us-central1:my-instance")
2024/05/16 16:27:42 [my-project:us-central1:my-instance] Connection info refresh operation scheduled immediately
2024/05/16 16:27:42 [my-project:us-central1:my-instance] Instance is closed, stopping refresh operations

Notice the connection test takes 30s to pass

enocom commented 4 months ago

This looks like a user-facing issue due to https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/issues/771.

Cooksauce commented 4 months ago

If I understand that linked issue correctly, it's a caching issue?

Does that still explain the 30s hang happening every time a postgres client connects to the proxy? Even subsequent connects of the same client as the same user to the same running cloud-sql-proxy instance hangs for another 30s.

enocom commented 4 months ago

Yes, it has to do with how the internal rate limiter works -- or in this case, not working.

enocom commented 4 months ago

We're seeing this in a separate place too now. I'll get this fixed.

enocom commented 4 months ago

This is definitely caused by https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/issues/771. Closing it as a duplicate. I have a fix that I'll put up today.

enocom commented 4 months ago

Since we'll need to upgrade cloudsqlconn here, let's use https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2224 as the tracking issue.

Cooksauce commented 4 months ago

Confirmed v2.11.3 fixes this issue

ref: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2212#issuecomment-2135816591