Recover gracefully from laptop sleep

enocom commented 1 year ago

Right now if a laptop goes to sleep with an active connection to a Cloud SQL instance, the socket factory won't recover on its own once the laptop has been restored.

There are limitations in Java to handling this:

https://stackoverflow.com/questions/52876556/how-does-java-calculate-sleep-time-when-pc-goes-into-hibernate-mode

Nonetheless, we should throw out a bad connection and force a refresh, such that the socket factory recovers gracefully from sleep.

See https://github.com/luryus/cloud-sql-jdbc-hang for a reproduction of this issue.

Whatever we do here, we should port to the AlloyDB Java Connector as well.

enocom commented 1 year ago

cc @luryus

This is a problem across Connectors and we do want to improve support for sleep.

luryus commented 11 months ago

Hi! I noticed that this (or a very similar issue) got fixed in the Go connector: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/pull/686

Any chance this could now get ported to this Java library as well?

annafang-google commented 11 months ago

Ack. We'll port the fix over shortly.

enocom commented 11 months ago

I think this change should match the Go connector -- checking for an invalid cert and blocking until one is available if the cert is in fact expired.

luryus commented 9 months ago

Hi @ttosta-google ,

it seems that this is still not fixed in v1.16.0 and with #1771.

Even though the library now immediately starts a refresh when it notices that the current cert is expired, the force refresh does not succeed. Logs indicate that it gets stuck in the refresh rate limiter somehow. The library logs "Acquiring rate limiter permit.", but not "Rate limiter permit acquired". All subsequent connection retries then just get the good old bad_certificate errors.

This reproduces with both my reproducer project and DBeaver.

Here's sample logs:

Feb 17, 2024 1:33:22 PM com.google.cloud.sql.core.Refresher handleRefreshResult
FINE: [<instance name>] Refresh Operation: Completed refresh with new certificate expiration at 2024-02-17T12:33:21Z.
Feb 17, 2024 1:33:22 PM com.google.cloud.sql.core.Refresher handleRefreshResult
FINE: [<instance name>] Refresh Operation: Next operation scheduled at 2024-02-17T12:29:20Z.
Feb 17, 2024 1:33:37 PM com.lkoskela.googlejdbchang.App main
FINE: Connection is still valid
Feb 17, 2024 1:33:57 PM com.lkoskela.googlejdbchang.App main
FINE: Connection is still valid

...
<sleep for more than an hour>
...

Feb 17, 2024 3:48:29 PM org.postgresql.jdbc.PgConnection isValid
FINE: Validating connection.
org.postgresql.util.PSQLException: This connection has been closed.
        at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:993)
        at org.postgresql.jdbc.PgConnection.setNetworkTimeout(PgConnection.java:1694)
        at org.postgresql.jdbc.PgConnection.isValid(PgConnection.java:1524)
        at com.lkoskela.googlejdbchang.App.main(App.java:56)

Feb 17, 2024 3:48:29 PM com.lkoskela.googlejdbchang.App main
INFO: Connection no longer valid
Feb 17, 2024 3:48:30 PM com.lkoskela.googlejdbchang.App main
INFO: Opening connection...
Feb 17, 2024 3:48:30 PM org.postgresql.Driver connect
FINE: Connecting with URL: jdbc:postgresql:///<redacted>
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection <init>
FINE: PostgreSQL JDBC Driver 42.6.0
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection setDefaultFetchSize
FINE:   setDefaultFetchSize = 0
Feb 17, 2024 3:48:30 PM org.postgresql.jdbc.PgConnection setPrepareThreshold
FINE:   setPrepareThreshold = 5
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl openConnectionImpl
FINE: Trying to establish a protocol version 3 connection to :5432
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher refreshIfExpired
FINE: [<instance name>] Client certificate has expired. Starting next refresh operation immediately.
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher forceRefresh
FINE: [<instance name>] Force Refresh: the next refresh operation was cancelled. Scheduling new refresh operation immediately.
Feb 17, 2024 3:48:30 PM com.google.cloud.sql.core.Refresher startRefreshAttempt
FINE: [<instance name>] Refresh Operation: Acquiring rate limiter permit.
Feb 17, 2024 3:48:30 PM jdk.internal.event.EventHelper logX509CertificateEvent
<redacted>
Feb 17, 2024 3:48:30 PM jdk.internal.event.EventHelper logTLSHandshakeEvent
FINE:  TLSHandshake: <redacted>
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl tryConnect
FINE: Receive Buffer Size is 65,536
Feb 17, 2024 3:48:30 PM org.postgresql.core.v3.ConnectionFactoryImpl tryConnect
FINE: Send Buffer Size is 65,536
Feb 17, 2024 3:48:30 PM org.postgresql.Driver connect
FINE: Connection error:
org.postgresql.util.PSQLException: The connection attempt failed.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:354)
        at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54)
        at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:263)
        at org.postgresql.Driver.makeConnection(Driver.java:443)
        at org.postgresql.Driver.connect(Driver.java:297)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:681)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
        at com.lkoskela.googlejdbchang.App.main(App.java:54)
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
        at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)

ttosta-google commented 9 months ago

Thanks for the info @luryus! I've reopened this issue and will investigate it further.

enocom commented 9 months ago

This is another example where our rate limiting approach to refreshes doesn't work so well.

The rate limiter should be only wait 30 seconds to try again, though.

luryus commented 7 months ago

Hi, any updates on this? With 1.17.1 this still happens. I just retested this with my test program and it seems that after the pc has been asleep, the rate limiter lets the force refresh succeed only after a long time - probably only after the duration for the next scheduled refresh is reached. Until that the connection attempts fail with the errors I pasted in my previous message, even if I wait more than 30 seconds between each attempt.

It looks like a thread pool is starved or something, and that's why the refresh operations get stuck in the rate limiter.

Please take a look at this again - I run to this almost daily while using DBeaver and it's super annoying.

enocom commented 7 months ago

We've recently fixed a similar issue in the Proxy. For now I'd recommend connecting through the Proxy (which handles sleep without issue). And meanwhile, we've moved this up in our list of priorities and will investigate soon.

luryus commented 3 months ago

This issue seems to be fixed by switching to the lazy refresh strategy. Makes sense - Windows sleeping disturbs timers just like idling on serverless platforms does, so avoiding them entirely fixes the problem.

There's a slight delay when the certificates are refreshed after I wake up my laptop and make a query in DBeaver, but that's not a problem at all.

Perhaps the docs could be updated to cover this use case?

jackwotherspoon commented 3 months ago

@luryus I would agree. Probably a good idea for us to update the README's Lazy Refresh section to mention the awaking from sleep use case.

GoogleCloudPlatform / cloud-sql-jdbc-socket-factory

Recover gracefully from laptop sleep #1558