GoogleCloudPlatform / cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Apache License 2.0
1.28k stars 349 forks source link

TLS Handshake failures with Go 1.23 #2296

Open dan-lind opened 2 months ago

dan-lind commented 2 months ago

Bug Description

When attempting to connect through the proxy, I'm getting connection reset by peer

Starting the proxy is fine

2024/09/12 08:58:14 Authorizing with Application Default Credentials
...
2024/09/12 08:58:15 [redacted-prod:europe-west1:redacted-live-773ed67c] Listening on /tmp/redacted-prod:europe-west1:redacted-live-773ed67c
2024/09/12 08:58:15 [redacted-prod:europe-west1:redacted-live-1c3b6da9] Listening on /tmp/redacted-prod:europe-west1:redacted-live-1c3b6da9
2024/09/12 08:58:15 [redacted-prod:europe-west1:redacted-live-6aab2946] Listening on /tmp/redacted-prod:europe-west1:redacted-live-6aab2946
2024/09/12 08:58:15 [redacted-prod:europe-west1:redacted-live-23a55252] Listening on /tmp/redacted-prod:europe-west1:redacted-live-23a55252
2024/09/12 08:58:15 The proxy has started successfully and is ready for new connections!
2024/09/12 08:58:47 [redacted-prod:europe-west1:redacted-live-156c5af9] Accepted connection from

but when I attempt to make the actual connection, I get

2024/09/12 08:58:47 [redacted-up-prod:europe-west1:redacted-live-6aab2946] failed to connect to instance: Dial error: handshake failed (connection name = "redacted-up-prod:europe-west1:redacted-live-6aab2946"): read tcp 172.22.164.170:53244->redacted:3307: read: connection reset by peer

In the client (Datagrip 2024.2.1) I see the error Socket is not connected. On the CloudSQL side, I can also see a warning log in Logs explorer like 1893634 [Warning] [MY-000000] [Server] Failed to connect to mysql: 1893634. Server-side error number: 1159

Stacktrace

Output from tcpdump

tcpdump: data link type PKTAP
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type PKTAP (Apple DLT_PKTAP), snapshot length 524288 bytes
08:58:47.153440 IP 172.22.164.170.53244 > redacted.3307: Flags [S], seq 2094967633, win 65535, options [mss 1360,nop,wscale 6,nop,nop,TS val 1730041868 ecr 0,sackOK,eol], length 0
08:58:47.216068 IP redacted.3307 > 172.22.164.170.53244: Flags [S.], seq 542659392, ack 2094967634, win 32384, options [mss 1332,sackOK,TS val 3082097024 ecr 1730041868,nop,wscale 7], length 0
08:58:47.216124 IP 172.22.164.170.53244 > redacted.3307: Flags [.], ack 1, win 2062, options [nop,nop,TS val 1730041931 ecr 3082097024], length 0
08:58:47.216806 IP 172.22.164.170.53244 > redacted.3307: Flags [.], seq 1:1321, ack 1, win 2062, options [nop,nop,TS val 1730041931 ecr 3082097024], length 1320
08:58:47.216810 IP 172.22.164.170.53244 > redacted.3307: Flags [P.], seq 1321:1486, ack 1, win 2062, options [nop,nop,TS val 1730041931 ecr 3082097024], length 165
08:58:47.264582 IP redacted.3307 > 172.22.164.170.53244: Flags [R.], seq 1, ack 1321, win 2062, length 0

Steps to reproduce?

This happens every time I try connecting though the terminal

Environment

  1. OS type and version: macos 14.6.1
  2. Cloud SQL Proxy version 2.13
  3. Proxy invocation command: cloud-sql-proxy --config-file ~/cloudsql/cloudsql-config.toml

Content of cloudsql-config.toml

#CloudSQL proxy config
unix-socket = "/tmp"
private-ip = true

...
instance-connection-name-30 = "redacted-prod:europe-west1:redacted-live-156c5af9"
instance-connection-name-31 = "redacted-prod:europe-west1:redacted-live-773ed67c"
instance-connection-name-32 = "redacted-prod:europe-west1:redacted-live-1c3b6da9"
instance-connection-name-33 = "redacted-prod:europe-west1:redacted-live-6aab2946"
instance-connection-name-34 = "redacted-prod:europe-west1:redacted-live-23a55252"

Additional Details

This happens only when I setup the proxy using the terminal. If I use Intellij/Datagrip, and use the built in MySQL CloudSql proxy driver, I can connect without issues.

jackwotherspoon commented 2 months ago

Hi @dan-lind thanks for raising an issue on the Cloud SQL Proxy!

The Cloud SQL Proxy forwards local traffic to port 3307 of your Cloud SQL instance. Because of this you will need to update your VPC network firewall rules potentially, they should allow TCP egress connection on port 3307 of your Cloud SQL instance private IP address. Can you see if this is maybe the cause of your error?

From our docs:

While the Cloud SQL Auth Proxy can listen on any port, it creates outgoing or egress connections to your Cloud SQL instance IP address on port 3307. If your client machine has an outbound firewall policy, make sure it allows outgoing connections to port 3307 on your Cloud SQL instance's IP.

Let me know as this is a common cause for the issue you are seeing 😄

enocom commented 2 months ago

Am I reading your config correctly that you're connecting to 34 instances?

dan-lind commented 2 months ago

Hi @dan-lind thanks for raising an issue on the Cloud SQL Proxy!

The Cloud SQL Proxy forwards local traffic to port 3307 of your Cloud SQL instance. Because of this you will need to update your VPC network firewall rules potentially, they should allow TCP egress connection on port 3307 of your Cloud SQL instance private IP address. Can you see if this is maybe the cause of your error?

From our docs:

While the Cloud SQL Auth Proxy can listen on any port, it creates outgoing or egress connections to your Cloud SQL instance IP address on port 3307. If your client machine has an outbound firewall policy, make sure it allows outgoing connections to port 3307 on your Cloud SQL instance's IP.

Let me know as this is a common cause for the issue you are seeing 😄

Hi! Like I wrote in "Additional details", connecting using the MySQL CloudSQL proxy in Datagrip works without any issues. Another tcpdump on that traffic shows that this connection also happens on 3307.

tcpdump: data link type PKTAP
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type PKTAP (Apple DLT_PKTAP), snapshot length 524288 bytes
18:44:01.730197 IP 172.22.164.170.52213 > redacted.3307: Flags [S], seq 308421121, win 65535, options [mss 1360,nop,wscale 6,nop,nop,TS val 1740779703 ecr 0,sackOK,eol], length 0
18:44:01.776055 IP redacted.3307 > 172.22.164.170.52213: Flags [S.], seq 2473669167, ack 308421122, win 32384, options [mss 1332,sackOK,TS val 3117211483 ecr 1740779703,nop,wscale 7], length 0
18:44:01.776115 IP 172.22.164.170.52213 > redacted.3307: Flags [.], ack 1, win 2062, options [nop,nop,TS val 1740779749 ecr 3117211483], length 0
18:44:01.777291 IP 172.22.164.170.52213 > redacted.3307: Flags [P.], seq 1:441, ack 1, win 2062, options [nop,nop,TS val 1740779750 ecr 3117211483], length 440
18:44:01.834037 IP redacted.3307 > 172.22.164.170.52213: Flags [.], ack 441, win 250, options [nop,nop,TS val 3117211536 ecr 1740779750], length 0
18:44:01.834064 IP redacted.3307 > 172.22.164.170.52213: Flags [.], seq 1:1321, ack 441, win 250, options [nop,nop,TS val 3117211538 ecr 1740779750], length 1320
18:44:01.834065 IP redacted.3307 > 172.22.164.170.52213: Flags [P.], seq 1321:1763, ack 441, win 250, options [nop,nop,TS val 3117211538 ecr 1740779750], length 442
18:44:01.834087 IP 172.22.164.170.52213 > redacted.3307: Flags [.], ack 1321, win 2041, options [nop,nop,TS val 1740779807 ecr 3117211538], length 0
18:44:01.834101 IP 172.22.164.170.52213 > redacted.3307: Flags [.], ack 1763, win 2034, options [nop,nop,TS val 1740779807 ecr 3117211538], length 0
18:44:01.835368 IP 172.22.164.170.52213 > redacted.3307: Flags [P.], seq 441:447, ack 1763, win 2034, options [nop,nop,TS val 1740779808 ecr 3117211538], length 6
18:44:01.838264 IP 172.22.164.170.52213 > redacted.3307: Flags [P.], seq 447:1422, ack 1763, win 2048, options [nop,nop,TS val 1740779811 ecr 3117211538], length 975
18:44:01.840919 IP 172.22.164.170.52213 > redacted.3307: Flags [P.], seq 1422:1724, ack 1763, win 2048, options [nop,nop,TS val 1740779814 ecr 3117211538], length 302
18:44:01.841028 IP 172.22.164.170.52213 > redacted.3307: Flags [P.], seq 1724:1798, ack 1763, win 2048, options [nop,nop,TS val 1740779814 ecr 3117211538], length 74
18:44:01.894001 IP redacted.3307 > 172.22.164.170.52213: Flags [.], ack 1724, win 249, options [nop,nop,TS val 3117211600 ecr 1740779808], length 0
18:44:01.894003 IP redacted.3307 > 172.22.164.170.52213: Flags [P.], seq 1763:2836, ack 1724, win 249, options [nop,nop,TS val 3117211601 ecr 1740779808], length 1073
dan-lind commented 2 months ago

Am I reading your config correctly that you're connecting to 34 instances?

This config has 34 entries, yes. However I've tried reducing it down to a single entry as well, it doesn't change anything.

enocom commented 2 months ago

This happens only when I setup the proxy using the terminal. If I use Intellij/Datagrip, and use the built in MySQL CloudSql proxy driver, I can connect without issues.

That's very interesting, because the built-in driver (which uses the Java Connector) does effectively exactly the same thing.

Have you tried connecting through the Proxy using a CLI tool, or something other than Datagrip to determine if Datagrip is involved?

dan-lind commented 2 months ago

Yes I have. I've tried "dumbing things down" as much as possible, for example to rule out any issues with the unix sockets.

$ cloud-sql-proxy --private-ip --port 3306 redacted-prod:europe-west1:redacted-live-156c5af9

2024/09/12 19:53:52 Authorizing with Application Default Credentials
2024/09/12 19:53:52 [redacted-prod:europe-west1:redacted-156c5af9] Listening on 127.0.0.1:3306
2024/09/12 19:53:52 The proxy has started successfully and is ready for new connections!

then using the mysql client (v8.4)

$ mysql -u redacted -p --host 127.0.0.1 --port 3306
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 2

yields the same result

2024/09/12 19:54:13 [redacted-prod:europe-west1:redacted-156c5af9] Accepted connection from 127.0.0.1:53534
2024/09/12 19:54:13 [redacted-prod:europe-west1:redacted-156c5af9] failed to connect to instance: Dial error: handshake failed (connection name = "redacted-prod:europe-west1:redacted-156c5af9"): read tcp 172.19.50.151:53535->redacted:3307: read: connection reset by peer

tcpdump looks similar

$ sudo tcpdump --interface any -nn port 3307                                                                                                                                               

tcpdump: data link type PKTAP
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type PKTAP (Apple DLT_PKTAP), snapshot length 524288 bytes
19:54:13.730494 IP 172.19.50.151.53535 > redacted.3307: Flags [S], seq 1648980533, win 65535, options [mss 1360,nop,wscale 6,nop,nop,TS val 2312443875 ecr 0,sackOK,eol], length 0
19:54:13.796034 IP redacted.3307 > 172.19.50.151.53535: Flags [S.], seq 1731921490, ack 1648980534, win 32384, options [mss 1332,sackOK,TS val 1671508660 ecr 2312443875,nop,wscale 7], length 0
19:54:13.796091 IP 172.19.50.151.53535 > redacted.3307: Flags [.], ack 1, win 2062, options [nop,nop,TS val 2312443941 ecr 1671508660], length 0
19:54:13.796688 IP 172.19.50.151.53535 > redacted.3307: Flags [.], seq 1:1321, ack 1, win 2062, options [nop,nop,TS val 2312443941 ecr 1671508660], length 1320
19:54:13.796695 IP 172.19.50.151.53535 > redacted.3307: Flags [P.], seq 1321:1486, ack 1, win 2062, options [nop,nop,TS val 2312443941 ecr 1671508660], length 165
19:54:13.843445 IP redacted.3307 > 172.19.50.151.53535: Flags [R.], seq 1, ack 1321, win 2062, length 0
enocom commented 2 months ago

Thanks for the report. In that case I think we need to look into this -- that's wholly unexpected and something we definitely don't see in our automated testing.

Out of curiosity do you see the same behavior across instances?

dan-lind commented 2 months ago

Yes, same across instances

tors 12 sep. 2024 kl. 21:49 skrev Eno Compton @.***>:

Thanks for the report. In that case I think we need to look into this -- that's wholly unexpected and something we definitely don't see in our automated testing.

Out of curiosity do you see the same behavior across instances?

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2296#issuecomment-2347113732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADJ2OD2XA5EVYLJ5AGVT6DZWHV5NAVCNFSM6AAAAABODAKRY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBXGEYTGNZTGI . You are receiving this because you were mentioned.Message ID: @.***>

jackwotherspoon commented 1 month ago

Hi @dan-lind this looks like an issue on the server side.

Do you happen to have a support contract with Google? If so you could open a case and they will have access to check your server logs.

Let me know, otherwise I can try and figure out a way to take a look deeper for you.

dan-lind commented 1 month ago

We do yes, I've reached out to them referring to this issue.

jackwotherspoon commented 1 month ago

We do yes, I've reached out to them referring to this issue.

Perfect, that is great to hear.

TokieSan commented 1 month ago

I'm experiencing the same issue when trying to connect through the proxy. I receive a "connection reset by peer" error during the handshake process. Here’s a snippet from my logs:

failed to connect to instance: Dial error: handshake failed (connection name = "[redacted]-prod:europe-west1:[redacted]-live-6aab2946"): read tcp [ip]:[port]->[redacted]:3307: read: connection reset by peer

This issue occurs consistently (direct access & via python's connectors), I have tried reinitializing from my main device & my alt but nothing seems to work. Other users (on same project, with the same permissions) are not facing this issue even though they are following the same connection procedure & running the same OS with the latest updates.

UPDATE: Downgrading to v2.11.3 seems to have solved the issue for now as a work around, will try to identify the version where the issue started

UPDATE 2: After some git bisecting the commit that started the issue was fd6bd49242c884508f641c754eb5cec5d28ac665 with the bump to go 1.23, reverting the version solves the issue

jackwotherspoon commented 1 month ago

UPDATE: Downgrading to v2.11.3 seems to have solved the issue for now as a work around, will try to identify the version where the issue started

UPDATE 2: After some git bisecting the commit that started the issue was fd6bd49242c884508f641c754eb5cec5d28ac665 with the bump to go 1.23, reverting the version solves the issue

@TokieSan Thanks for the detailed analysis and extra data points! I would recommend pinning to an older version for now, in the meantime I will investigate this further to see if there is maybe something on the server side that is causing issues with the latest Go 1.23 TLS changes.

dan-lind commented 1 month ago

Confirming that downgrading to v2.11.3 solve the issue for me as well!

jackwotherspoon commented 1 month ago

Confirming that downgrading to v2.11.3 solve the issue for me as well!

Thanks for confirming @dan-lind , appreciate it.

I am just curious if you were able to successfully open up a support case for this issue? If so I can escalate it to get folks to look at the server side and see if we can pinpoint the issue.

dan-lind commented 1 month ago

I am just curious if you were able to successfully open up a support case for this issue? If so I can escalate it to get folks to look at the server side and see if we can pinpoint the issue.

Yes, I have a support case open!

EDIT: @jackwotherspoon I don't know how this usually works, but I've asked in my support case for them to reach out to you.

jackwotherspoon commented 1 month ago

EDIT: @jackwotherspoon I don't know how this usually works, but I've asked in my support case for them to reach out to you.

That is perfect, thanks

jackwotherspoon commented 1 month ago

@dan-lind @TokieSan I am taking a look at the server side to see if there are any issues with the use of Go 1.23

In the meantime, I wanted to point out https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2294 which was also an issue for 2.13.0 when connecting to a Private IP Cloud SQL instance via a VPN...

looking at the changelog of go version 1.22..1.23, I got this change: experimental post-quantum key exchange mechanism X25519Kyber768Draft00 is now enabled by default when Config.CurvePreferences is nil. The default can be reverted by adding tlskyber=0 to the GODEBUG environment variable.

disabling it, using GODEBUG=tlskyber=0 cloud-sql-proxy-2.13.0 ... make cloud-sql-proxy works again inside the vpn.

I wonder if one of you could try starting the v2.13.0 Proxy with GODEBUG=tlskyber=0 cloud-sql-proxy ... and see if maybe this error is also a symptom of the same issue?

dan-lind commented 1 month ago

I wonder if one of you could try starting the v2.13.0 Proxy with GODEBUG=tlskyber=0 cloud-sql-proxy ... and see if maybe this error is also a symptom of the same issue?

Yes! GODEBUG=tlskyber=0 indeed works with 2.13, i.e. the error goes away and I can connect just as when downgrading to v2.11.3.

jackwotherspoon commented 1 month ago

Yes! GODEBUG=tlskyber=0 indeed works with 2.13, i.e. the error goes away and I can connect just as when downgrading to v2.11.3.

@dan-lind Great to hear it!

I am going to look at open Go issue's and see if I need to create one as it seems like the new default preferences implemented in https://github.com/golang/go/issues/67061 are not playing nicely with VPNs

enocom commented 1 month ago

I think we have a similar problem with SOCKS proxies too. https://github.com/GoogleCloudPlatform/alloydb-auth-proxy/issues/714

enocom commented 1 week ago

FYI https://github.com/GoogleCloudPlatform/alloydb-auth-proxy/issues/714#issuecomment-2460486141

We should probably make this a feature request, since the root cause of these TLS handshake failures are buggy TLS servers.