Closed otto-nordander-yubico closed 1 month ago
Hi @otto-nordander-yubico thanks for raising an issue on the Cloud SQL Proxy! π
I will happily take a look into this, a couple clarifying questions:
I'll look at re-producing this on my end.
Hi @otto-nordander-yubico thanks for raising an issue on the Cloud SQL Proxy! π
I will happily take a look into this, a couple clarifying questions:
- How often are you seeing this issue?
- When you disconnect and connect again are you doing so immediately after one another?
- Are you only seeing this issue with service account impersonation? (if you haven't tried without that is okay)
I'll look at re-producing this on my end.
Let me know if there's anything else I can help with.
One thing you could try is enable debug logging which would help us understand if the background refresh is behaving as expected. It's possible the token we're sending to the server is expired somehow. The other possibility is that the backend IAM check is failing for some other reason.
One thing you could try is enable debug logging which would help us understand if the background refresh is behaving as expected. It's possible the token we're sending to the server is expired somehow. The other possibility is that the backend IAM check is failing for some other reason.
Yes that's a good point, here I ran with --debug, --debug-logs
.
This is when running locally:
cloud-sql-proxy version 2.11.0+darwin.arm64
2024/05/14 20:51:13 Impersonating service account with Application Default Credentials
2024/05/14 20:51:13 [xxxxx] Connection info added to cache
2024/05/14 20:51:13 [xxxxx] Listening on 127.0.0.1:5432
2024/05/14 20:51:13 The proxy has started successfully and is ready for new connections!
2024/05/14 20:51:13 [xxxxx] Connection info refresh operation started
2024/05/14 20:51:13 Enabling pprof endpoints at localhost:9091
2024/05/14 20:51:14 [xxxxx] Accepted connection from 127.0.0.1:55371 <--- me connecting is successfully
2024/05/14 20:51:14 [xxxxx] Connection info refresh operation started
2024/05/14 20:51:15 [xxxxx] Connection info refresh operation complete
2024/05/14 20:51:15 [xxxxx] Current certificate expiration = 2024-05-14T19:51:14Z
2024/05/14 20:51:15 [xxxxx] Connection info refresh operation scheduled at 2024-05-14T19:47:14Z (now + 56m0s)
2024/05/14 20:51:15 [xxxxx] Now = 2024-05-14T18:51:15Z, Current cert expiration = 2024-05-14T19:51:14Z
2024/05/14 20:51:15 [xxxxx] Cert is valid = true
2024/05/14 20:51:15 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:51:16 [xxxxx] Connection info refresh operation complete
2024/05/14 20:51:16 [xxxxx] Current certificate expiration = 2024-05-14T19:51:15Z
2024/05/14 20:51:16 [xxxxx] Connection info refresh operation scheduled at 2024-05-14T19:47:15Z (now + 56m0s)
2024/05/14 20:51:16 [xxxxx] client closed the connection <---- me exiting the first connection
2024/05/14 20:51:17 [xxxxx] Accepted connection from 127.0.0.1:55387 <---- me opening new connection
2024/05/14 20:51:17 [xxxxx] Now = 2024-05-14T18:51:17Z, Current cert expiration = 2024-05-14T19:51:15Z
2024/05/14 20:51:17 [xxxxx] Cert is valid = true
2024/05/14 20:51:17 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:51:18 [xxxxx] client closed the connection <---- this is because psql failed to connect
After the proxy has reached this broken state, every connection try logs the following (must restart proxy for it to work again):
2024/05/14 20:58:07 [xxxxx] Accepted connection from 127.0.0.1:57119
2024/05/14 20:58:07 [xxxxx] Now = 2024-05-14T18:58:07Z, Current cert expiration = 2024-05-14T19:54:49Z
2024/05/14 20:58:07 [xxxxx] Cert is valid = true
2024/05/14 20:58:07 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:58:08 [xxxxx] client closed the connection
Edit: perhaps some helping information, when this happens I can see that the server asks the client for the password and its sent. When it's working as expected the password is never emitted.
when this happens I can see that the server asks the client for the password and its sent.
That's interesting. The way Auto IAM AuthN works is that the Proxy transmits the OAuth2 token to the server through the client certificate. So Postgres shouldn't be asking for a password. That fact that it is implies the token isn't properly being received by the server.
Perhaps another piece of the puzzle: I ran with MySQL and tried to reproduce locally with the same connect / disconnect / connect flow. And after a few tries it also gets stuck in a bad state with the following error:
ERROR 2059 (HY000): Authentication plugin 'mysql_clear_password' cannot be loaded: plugin not enabled
Okay, I think I've managed to pin it down. It's within cloudsqlconn
What I see is that there's a race condition when starting the proxy and initiating the first connection. When we start the proxy a refresh operation is scheduled instantly here https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/blob/main/internal/cloudsql/instance.go#L362 When we connect it also schedules an update of the refresh, and sometimes that runs before the initial refresh is complete. So what happens is that the update refresh succeeds and gets a valid certificate back, then the initial refresh runs and overwrites the cert.
The cancel()
here fails, and returns false
, I believe since the initial refresh started immediately.
Edit: with that in mind it has nothing to do with having two connections, its simply that the first connection gets a valid cert, and then the initial refresh overwrites the cert so next connection fails.
Nice find. I suspect this is due to https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/issues/771.
I'll get this fixed. Thanks for your careful debugging.
Yes, that would explain why the certs were different in the first place. But also, from this comment it gives the impression that it should handle multiple requests.
Anyway, let me know if you have a patch that you want help testing.
Edit: I'm currently testing --lazy-refresh
as a workaround and that seems to work fine so far.
This also manifests as https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2224.
This is definitely caused by https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/issues/771. I'll have a fix ready to try today or tomorrow.
PR to fix this in cloudsqlconn is here: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/pull/806. We'll get that out in a release soon.
The latest release of the Go Connector has a fix: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1.
The latest release of the Go Connector has a fix: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1.
Nice, I can pull that in and run my tests to verify tomorrow!
Tested with v1.10.1
and wasn't able to reproduce the issue! Nicely done :)
And thanks for the quick fix!
@otto-nordander-yubico Glad your testing went successfully π Thanks so much for raising this issue, we greatly appreciate it!
I will close this out, if the issue re-surfaces feel free to re-open this.
π π @enocom
Actually I will wait till we release this officially to close out
Release of version v2.11.3 has the new Go Connector version used and should now resolve this issue π
Bug Description
Running
cloud-sql-proxy
sometimes it gets stuck in a state where we cant connect, returningCloud SQL IAM service account authentication failed for user errors
errors.When this happens the proxy container must restart before it works again, i.e. retrying doesn't help. It seems to happen if we connect multiple times in a row. E.g. start proxy, connect successfully, disconnect, connect again and it hangs but I'm not sure.
I've managed to reproduce it locally with the same error.
Example code (or command)
I've managed to reproduce with the following (note it's rather tricky to reproduce, and might require multiple tries):
Stacktrace
Steps to reproduce?
See example code.
Environment
cloud-sql-proxy --port 5432 --auto-iam-authn --impersonate-service-account=<SA> <postgres instance>
Additional Details
I swear I saw another issue on this that was resolved in this repo, but I can't find it anylonger.