GoogleCloudPlatform / cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Apache License 2.0
1.25k stars 345 forks source link

"Cloud SQL IAM service account authentication failed for user ..." intermittent errors when connecting to Postgres #2212

Closed otto-nordander-yubico closed 1 month ago

otto-nordander-yubico commented 2 months ago

Bug Description

Running cloud-sql-proxy sometimes it gets stuck in a state where we cant connect, returning Cloud SQL IAM service account authentication failed for user errors errors.

When this happens the proxy container must restart before it works again, i.e. retrying doesn't help. It seems to happen if we connect multiple times in a row. E.g. start proxy, connect successfully, disconnect, connect again and it hangs but I'm not sure.

I've managed to reproduce it locally with the same error.

Example code (or command)

I've managed to reproduce with the following (note it's rather tricky to reproduce, and might require multiple tries):

cloud-sql-proxy --port 5432 --auto-iam-authn --impersonate-service-account=<SA> <postgres instance>

# In a different terminal (password prompt doesn't matter)
psql --host 127.0.0.1 --port 5432 --username=<USER>

<Ctrl+D>
# Connect again
psql --host 127.0.0.1 --port 5432 --username=<USER>

> psql: error: connection to server at "127.0.0.1", port 5432 failed: FATAL:  Cloud SQL IAM service account authentication failed for user ...

Stacktrace

psql: error: connection to server at "127.0.0.1", port 5432 failed: FATAL:  Cloud SQL IAM service account authentication failed for user ...

Steps to reproduce?

See example code.

Environment

  1. MacOS/Linux
  2. Cloud SQL Proxy version: v2.9.0 and v2.11.0
  3. Proxy invocation command: cloud-sql-proxy --port 5432 --auto-iam-authn --impersonate-service-account=<SA> <postgres instance>

Additional Details

I swear I saw another issue on this that was resolved in this repo, but I can't find it anylonger.

jackwotherspoon commented 2 months ago

Hi @otto-nordander-yubico thanks for raising an issue on the Cloud SQL Proxy! πŸ˜„

I will happily take a look into this, a couple clarifying questions:

  1. How often are you seeing this issue?
  2. When you disconnect and connect again are you doing so immediately after one another?
  3. Are you only seeing this issue with service account impersonation? (if you haven't tried without that is okay)

I'll look at re-producing this on my end.

otto-nordander-yubico commented 2 months ago

Hi @otto-nordander-yubico thanks for raising an issue on the Cloud SQL Proxy! πŸ˜„

I will happily take a look into this, a couple clarifying questions:

  1. How often are you seeing this issue?
  2. When you disconnect and connect again are you doing so immediately after one another?
  3. Are you only seeing this issue with service account impersonation? (if you haven't tried without that is okay)

I'll look at re-producing this on my end.

  1. Rough estimate would be 1/30, but locally I can reproduce a lot more often than that. I think it's timing based which is why we see it "randomly".
  2. Yes, but I'm not sure if the disconnect is important, I think it's that we open 2 connections, but I can try to pin this down.
  3. No, we're seeing it with both.

Let me know if there's anything else I can help with.

enocom commented 2 months ago

One thing you could try is enable debug logging which would help us understand if the background refresh is behaving as expected. It's possible the token we're sending to the server is expired somehow. The other possibility is that the backend IAM check is failing for some other reason.

otto-nordander-yubico commented 2 months ago

One thing you could try is enable debug logging which would help us understand if the background refresh is behaving as expected. It's possible the token we're sending to the server is expired somehow. The other possibility is that the backend IAM check is failing for some other reason.

Yes that's a good point, here I ran with --debug, --debug-logs. This is when running locally:

cloud-sql-proxy version 2.11.0+darwin.arm64
2024/05/14 20:51:13 Impersonating service account with Application Default Credentials
2024/05/14 20:51:13 [xxxxx] Connection info added to cache
2024/05/14 20:51:13 [xxxxx] Listening on 127.0.0.1:5432
2024/05/14 20:51:13 The proxy has started successfully and is ready for new connections!
2024/05/14 20:51:13 [xxxxx] Connection info refresh operation started
2024/05/14 20:51:13 Enabling pprof endpoints at localhost:9091
2024/05/14 20:51:14 [xxxxx] Accepted connection from 127.0.0.1:55371 <--- me connecting is successfully
2024/05/14 20:51:14 [xxxxx] Connection info refresh operation started
2024/05/14 20:51:15 [xxxxx] Connection info refresh operation complete
2024/05/14 20:51:15 [xxxxx] Current certificate expiration = 2024-05-14T19:51:14Z
2024/05/14 20:51:15 [xxxxx] Connection info refresh operation scheduled at 2024-05-14T19:47:14Z (now + 56m0s)
2024/05/14 20:51:15 [xxxxx] Now = 2024-05-14T18:51:15Z, Current cert expiration = 2024-05-14T19:51:14Z
2024/05/14 20:51:15 [xxxxx] Cert is valid = true
2024/05/14 20:51:15 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:51:16 [xxxxx] Connection info refresh operation complete
2024/05/14 20:51:16 [xxxxx] Current certificate expiration = 2024-05-14T19:51:15Z
2024/05/14 20:51:16 [xxxxx] Connection info refresh operation scheduled at 2024-05-14T19:47:15Z (now + 56m0s)
2024/05/14 20:51:16 [xxxxx] client closed the connection <---- me exiting the first connection
2024/05/14 20:51:17 [xxxxx] Accepted connection from 127.0.0.1:55387  <---- me opening new connection
2024/05/14 20:51:17 [xxxxx] Now = 2024-05-14T18:51:17Z, Current cert expiration = 2024-05-14T19:51:15Z
2024/05/14 20:51:17 [xxxxx] Cert is valid = true
2024/05/14 20:51:17 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:51:18 [xxxxx] client closed the connection <---- this is because psql failed to connect

After the proxy has reached this broken state, every connection try logs the following (must restart proxy for it to work again):

2024/05/14 20:58:07 [xxxxx] Accepted connection from 127.0.0.1:57119
2024/05/14 20:58:07 [xxxxx] Now = 2024-05-14T18:58:07Z, Current cert expiration = 2024-05-14T19:54:49Z
2024/05/14 20:58:07 [xxxxx] Cert is valid = true
2024/05/14 20:58:07 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:58:08 [xxxxx] client closed the connection

Edit: perhaps some helping information, when this happens I can see that the server asks the client for the password and its sent. When it's working as expected the password is never emitted.

enocom commented 2 months ago

when this happens I can see that the server asks the client for the password and its sent.

That's interesting. The way Auto IAM AuthN works is that the Proxy transmits the OAuth2 token to the server through the client certificate. So Postgres shouldn't be asking for a password. That fact that it is implies the token isn't properly being received by the server.

otto-nordander-yubico commented 2 months ago

Perhaps another piece of the puzzle: I ran with MySQL and tried to reproduce locally with the same connect / disconnect / connect flow. And after a few tries it also gets stuck in a bad state with the following error:

ERROR 2059 (HY000): Authentication plugin 'mysql_clear_password' cannot be loaded: plugin not enabled

otto-nordander-yubico commented 2 months ago

Okay, I think I've managed to pin it down. It's within cloudsqlconn

What I see is that there's a race condition when starting the proxy and initiating the first connection. When we start the proxy a refresh operation is scheduled instantly here https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/blob/main/internal/cloudsql/instance.go#L362 When we connect it also schedules an update of the refresh, and sometimes that runs before the initial refresh is complete. So what happens is that the update refresh succeeds and gets a valid certificate back, then the initial refresh runs and overwrites the cert.

The cancel() here fails, and returns false, I believe since the initial refresh started immediately.

Edit: with that in mind it has nothing to do with having two connections, its simply that the first connection gets a valid cert, and then the initial refresh overwrites the cert so next connection fails.

enocom commented 1 month ago

Nice find. I suspect this is due to https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/issues/771.

I'll get this fixed. Thanks for your careful debugging.

otto-nordander-yubico commented 1 month ago

Yes, that would explain why the certs were different in the first place. But also, from this comment it gives the impression that it should handle multiple requests.

Anyway, let me know if you have a patch that you want help testing.

Edit: I'm currently testing --lazy-refresh as a workaround and that seems to work fine so far.

enocom commented 1 month ago

This also manifests as https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2224.

This is definitely caused by https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/issues/771. I'll have a fix ready to try today or tomorrow.

enocom commented 1 month ago

PR to fix this in cloudsqlconn is here: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/pull/806. We'll get that out in a release soon.

enocom commented 1 month ago

The latest release of the Go Connector has a fix: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1.

otto-nordander-yubico commented 1 month ago

The latest release of the Go Connector has a fix: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1.

Nice, I can pull that in and run my tests to verify tomorrow!

otto-nordander-yubico commented 1 month ago

Tested with v1.10.1 and wasn't able to reproduce the issue! Nicely done :) And thanks for the quick fix!

jackwotherspoon commented 1 month ago

@otto-nordander-yubico Glad your testing went successfully πŸ˜„ Thanks so much for raising this issue, we greatly appreciate it!

I will close this out, if the issue re-surfaces feel free to re-open this.

πŸ‘ πŸ‘ @enocom

jackwotherspoon commented 1 month ago

Actually I will wait till we release this officially to close out

jackwotherspoon commented 1 month ago

Release of version v2.11.3 has the new Go Connector version used and should now resolve this issue πŸš€