Open joshbautista opened 1 month ago
Possibly relevant changes between 0.37.0 and 0.38.0:
com.google.cloud:libraries-bom
upgraded from 26.44.0 to 26.47.0com.google.cloud:google-cloud-spanner
bumped from 6.72.0 to 6.74.1Possibly relevant changes between google-cloud-spanner 6.72.0 and 6.74.1:
One interesting difference in the thread dumps:
"grpc-transport-17440" #18151 [18152] daemon prio=5 os_prio=0 cpu=4.50ms elapsed=8555.78s tid=0x00007dd26801e650 nid=18152 waiting on condition [0x00007dd26d5f9000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)
- parking to wait for <0x00000000caf17060> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21.0.4/Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21.0.4/Unknown Source)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@21.0.4/Unknown Source)
at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:107)
at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:35)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:246)
at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:55)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at com.google.cloud.spanner.GrpcValueIterator.ensureReady(GrpcValueIterator.java:134)
at com.google.cloud.spanner.GrpcValueIterator.getMetadata(GrpcValueIterator.java:110)
at com.google.cloud.spanner.GrpcResultSet.next(GrpcResultSet.java:81)
at com.google.cloud.spanner.SessionPool$PooledSession.keepAlive(SessionPool.java:1804)
at com.google.cloud.spanner.SessionPool$PooledSession.access$4900(SessionPool.java:1588)
at com.google.cloud.spanner.SessionPool$PoolMaintainer.keepAliveSessions(SessionPool.java:2133)
at com.google.cloud.spanner.SessionPool$PoolMaintainer.maintainPool(SessionPool.java:2066)
at com.google.cloud.spanner.SessionPool$PoolMaintainer$$Lambda/0x00007dd34a31fd40.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@21.0.4/Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(java.base@21.0.4/Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@21.0.4/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.4/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.4/Unknown Source)
at java.lang.Thread.runWith(java.base@21.0.4/Unknown Source)
at java.lang.Thread.run(java.base@21.0.4/Unknown Source)
None of the working thread dumps contain this entry. This stack dump comes from the keep-alive queries in the normal session pool. Note that this should be unrelated to the change to the core pool size for maintaining multiplexed sessions, as that is a completely separate part of the code, and only sounds/look the same due to the names.
The thread dump seems to indicate that the keep-alive query is stuck while waiting for results to be returned from Spanner for a simple SELECT 1
query.
The thread dump where a psql query is stuck contains this stackdump for the query that is stuck:
"ConnectionHandler-51" #18180 [26025] daemon prio=5 os_prio=0 cpu=76.47ms elapsed=443.52s tid=0x00007dd3080503a0 nid=26025 waiting on condition [0x00007dd26d4f9000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)
- parking to wait for <0x00000000b6ab0538> (a com.google.common.util.concurrent.AbstractCatchingFuture$CatchingFuture)
at java.util.concurrent.locks.LockSupport.park(java.base@21.0.4/Unknown Source)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:564)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:67)
at com.google.cloud.spanner.SpannerApiFutures.getOrNull(SpannerApiFutures.java:31)
at com.google.cloud.spanner.SpannerApiFutures.get(SpannerApiFutures.java:26)
at com.google.cloud.spanner.connection.ConnectionImpl.internalExecuteQuery(ConnectionImpl.java:1753)
at com.google.cloud.spanner.connection.ConnectionImpl.internalExecute(ConnectionImpl.java:1184)
at com.google.cloud.spanner.connection.ConnectionImpl.execute(ConnectionImpl.java:1163)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.executeOnSpannerWithLogging(BackendConnection.java:481)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.executeOnSpanner(BackendConnection.java:460)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.analyzeOrExecute(BackendConnection.java:442)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection$Execute.doExecute(BackendConnection.java:368)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection$BufferedStatement.execute(BackendConnection.java:234)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection.flush(BackendConnection.java:1165)
at com.google.cloud.spanner.pgadapter.statements.BackendConnection.sync(BackendConnection.java:1049)
at com.google.cloud.spanner.pgadapter.statements.ExtendedQueryProtocolHandler.sync(ExtendedQueryProtocolHandler.java:194)
at com.google.cloud.spanner.pgadapter.wireprotocol.SyncMessage.sendPayload(SyncMessage.java:40)
at com.google.cloud.spanner.pgadapter.wireprotocol.WireMessage.send(WireMessage.java:81)
at com.google.cloud.spanner.pgadapter.statements.SimpleQueryStatement.execute(SimpleQueryStatement.java:111)
at com.google.cloud.spanner.pgadapter.wireprotocol.QueryMessage.sendPayload(QueryMessage.java:65)
at com.google.cloud.spanner.pgadapter.wireprotocol.WireMessage.send(WireMessage.java:81)
at com.google.cloud.spanner.pgadapter.ConnectionHandler.handleMessages(ConnectionHandler.java:506)
at com.google.cloud.spanner.pgadapter.ConnectionHandler.runConnection(ConnectionHandler.java:420)
at com.google.cloud.spanner.pgadapter.ConnectionHandler.run(ConnectionHandler.java:341)
at java.lang.Thread.runWith(java.base@21.0.4/Unknown Source)
at java.lang.Thread.run(java.base@21.0.4/Unknown Source)
This indicates that PGAdapter is waiting for the Java client to execute and return the results of a query. The corresponding Spanner connection that is executing the query:
"connection-executor-50" #18181 [26026] daemon prio=5 os_prio=0 cpu=30.51ms elapsed=431.87s tid=0x00007dd1d80201e0 nid=26026 waiting on condition [0x00007dd26d3fa000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)
- parking to wait for <0x00000000b6ab0638> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21.0.4/Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21.0.4/Unknown Source)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@21.0.4/Unknown Source)
at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:107)
at com.google.cloud.spanner.GrpcStreamIterator.computeNext(GrpcStreamIterator.java:35)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:246)
at com.google.cloud.spanner.ResumableStreamIterator.computeNext(ResumableStreamIterator.java:55)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at com.google.cloud.spanner.GrpcValueIterator.ensureReady(GrpcValueIterator.java:134)
at com.google.cloud.spanner.GrpcValueIterator.getMetadata(GrpcValueIterator.java:110)
at com.google.cloud.spanner.GrpcResultSet.next(GrpcResultSet.java:81)
at com.google.cloud.spanner.connection.DirectExecuteResultSet.<init>(DirectExecuteResultSet.java:68)
at com.google.cloud.spanner.connection.DirectExecuteResultSet.ofResultSet(DirectExecuteResultSet.java:62)
at com.google.cloud.spanner.connection.SingleUseTransaction.lambda$executeQueryAsync$0(SingleUseTransaction.java:267)
at com.google.cloud.spanner.connection.SingleUseTransaction$$Lambda/0x00007dd34a482ca8.call(Unknown Source)
at io.grpc.Context$2.call(Context.java:583)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at io.opentelemetry.context.Context.lambda$wrap$1(Context.java:212)
at io.opentelemetry.context.Context$$Lambda/0x00007dd34a487b70.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.4/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.4/Unknown Source)
at java.lang.Thread.runWith(java.base@21.0.4/Unknown Source)
at java.lang.Thread.run(java.base@21.0.4/Unknown Source)
This is an indication that also this query is just stuck waiting for results to come back from Spanner. The session pool keep-alive query does not use a custom timeout, and instead just uses the default timeouts of the Spanner library. This could be consistent with the observation that INSERT
and UPDATE
statements from the application that use the default timeouts in the Spanner library also suffer from the same issue.
Interesting additional fact: The keep-alive queries that are executed by the session pools in the cases where the queries are stuck, are being executed by a grpc-transport-executor
thread. That is also the only existing grpc-transport-executor
thread at the time of those queries, which to me indicates that no results will ever be delivered, because that would normally be done by the grpc-transport-executor
thread(s).
grpc-transport-executor
threads are a dead end. Those threads do not play a role in the gRPC calls involved in this case.The only way so far that I've been able to reproduce this issue is by dropping all incoming network traffic to the underlying Spanner client. That means that outgoing messages do not cause any errors, but queries or other statements just wait indefinitely for a response from Spanner.
The specific waiting_for_connection
part of the error message is logged here: https://github.com/grpc/grpc-java/blob/0b2c17d0da84e92fbb104bd46ccb2f61860db04b/core/src/main/java/io/grpc/internal/DelayedStream.java#L111
This is logged when the timeout occurs while the client is waiting for the stream (connection) to be established. That is: The client has started an attempt to execute a streaming RPC. This then times out while waiting for the stream to be established.
grpc-core versions in the relevant PGAdapter versions are:
v0.37.0: io.grpc:grpc-core:jar:1.65.1
v0.38.0: io.grpc:grpc-core:jar:1.66.0
Current Behavior
Beginning in versions 0.38.0 and 0.39.0, we've randomly observed the following error from PGAdapter:
The deadline of ~55 seconds is expected since we run
SET statement_timeout = 55000;
on every Read query to prevent Spanner from running queries longer than necessary.When a PGAdapter instance enters this unstable state, it seems some connections in the pool are tainted while others work just fine. We can run
psql
against a faulted instance and run simple queries (e.g.select * from table limit 1;
) successfully a few times, but it will stall randomly and eventually time out with thewaiting_for_connection
error. This stall occurs without astatement_timeout
set as well, we just won't see the error come through as it's attempting to run indefinitely.We have also seen this on write statements (e.g. UPDATE, INSERT) albeit with a lower timeout, 30 seconds, which does seem to match Spanner gRPC configuration defaults.
Reproducing the error state is quite difficult since we get all sorts of traffic to each pod and it only seems to manifest itself against one or two pods out of 120 at a time. It will generally occur 24 hours after a new Deployment, but sometimes will show up in as little as a few hours.
Reverting to version 0.37.0 fixes this behavior (so far at least).
Context (Environment)
Environment Variables: JDK_JAVA_OPTIONS = "-Xmx1600M -Xms1600M -XshowSettings:vm"