GoogleCloudPlatform / pubsub

This repository contains open-source projects managed by the owners of Google Cloud Pub/Sub.
Apache License 2.0
245 stars 146 forks source link

PubSub publisher keeps getting DEADLINE_EXCEEDED after a link switchover #206

Open piter75 opened 5 years ago

piter75 commented 5 years ago

We are using PubSub connector for Kafka Connect from the following commit: https://github.com/GoogleCloudPlatform/pubsub/commit/7d9d4b6f47bf3b87d12b51b6e8319f74a3e829be

The other day we had a problem with one of our network links. This particular link is used to route data between one of our data centres and GCP.

The problem with the link started at 10:18PM and and after tens of seconds the routing has been changed with BGP to use another link and the traffic between our data centre and GCP has been restored. Nonetheless PubSub connector in our Kafka cluster kept reporting DEADLINE_EXCEEDED errors for another half an hour until 10:48PM when it suddenly started working properly again.

May 16 22:48:24 connect-distributed.sh[3366]: [2019-05-16 22:48:24,221] ERROR WorkerSinkTask{id=ohsumi_page_view_raw-0} Commit of offsets threw an unexpected exception for sequence number 40: null (org.apache.kafka.connect.runti
May 16 22:48:24 connect-distributed.sh[3366]: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exc
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.pubsub.kafka.sink.CloudPubSubSinkTask.flush(CloudPubSubSinkTask.java:296)
May 16 22:48:24 connect-distributed.sh[3366]:         at org.apache.kafka.connect.sink.SinkTask.preCommit(SinkTask.java:125)
May 16 22:48:24 connect-distributed.sh[3366]:         at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:379)
May 16 22:48:24 connect-distributed.sh[3366]:         at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:209)
May 16 22:48:24 connect-distributed.sh[3366]:         at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
May 16 22:48:24 connect-distributed.sh[3366]:         at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
May 16 22:48:24 connect-distributed.sh[3366]:         at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.lang.Thread.run(Thread.java:748)
May 16 22:48:24 connect-distributed.sh[3366]: Caused by: java.util.concurrent.ExecutionException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 99999
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:531)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:492)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.pubsub.kafka.sink.CloudPubSubSinkTask.flush(CloudPubSubSinkTask.java:294)
May 16 22:48:24 connect-distributed.sh[3366]:         ... 11 more
May 16 22:48:24 connect-distributed.sh[3366]: Caused by: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999958496ns
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:51)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1052)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1030)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:871)
May 16 22:48:24 connect-distributed.sh[3366]:         at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:716)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
May 16 22:48:24 connect-distributed.sh[3366]:         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
May 16 22:48:24 connect-distributed.sh[3366]:         ... 3 more
May 16 22:48:24 connect-distributed.sh[3366]: Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999958496ns
May 16 22:48:24 connect-distributed.sh[3366]:         at io.grpc.Status.asRuntimeException(Status.java:526)
May 16 22:48:24 connect-distributed.sh[3366]:         ... 23 more

During that time we observed increased number of message publish operations in Stackdriver monitoring for published topics (presumably retries). I have checked with our network team and the original link was not restored until later that night.

What is mysterious to us is why this error kept happening after the link switchover and why it resolved itself half an hour later although not much else has changed - at least to our knowledge. We have a hypothesis that after the link switchover the connection to PubSub still kept the old TCP session open and publisher did not receive publish confirmation. Then maybe at 10:48PM the session was forcefully closed at one of the network gateways in between (30 mins seems like a pretty often used session close time) and published has to reconnect, opened a new session and it was publishing uninterrupted since then.

I know it is a difficult issue to reproduce but maybe it also happened to you and there are some hidden flags which we could use but even do not know about to make reconnections in such cases smooth.

callosin commented 5 years ago

Hi,

I am experimenting the same problem with the v0.3 tag revision. For some reason, it started failling like 2 weeks ago without changing anything.

jul. 01, 2019 9:51:37 A. M. com.google.common.util.concurrent.AggregateFuture$RunningState handleException GRAVE: Got more than one input Future failure. Logging failures after the first com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999419276ns at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:51) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) at com.google.common.util.concurrent.Futures$4.run(Futures.java:1123) at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435) at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:900) at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:811) at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:675) at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:507) at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482) at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678) at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999419276ns at io.grpc.Status.asRuntimeException(Status.java:526) ... 22 more

any idea?

mkuthan commented 5 years ago

Any updates?

vthornheart-bng commented 4 years ago

I'm getting a similar error - the data is making it to Google Pub/Sub successfully, but I keep seeing it repeat this error over and over. I can send one message to the Kafka topic, watch it make it to Google Pub/Sub, and then watch it churn on this same error over and over for ~20 minutes.

I also see the following as it's churning through repeating this error, if it helps for the bug:

[2020-05-04 22:58:09,966] ERROR WorkerSinkTask{id=X} Offset commit failed, rewinding to last committed offsets (org.apache.kafka.connect.runtime.WorkerSinkTask) java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.google.api.gax.rpc.DeadlineExceededException: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 9999389985ns at com.google.pubsub.kafka.sink.CloudPubSubSinkTask.flush(CloudPubSubSinkTask.java:324) at org.apache.kafka.connect.sink.SinkTask.preCommit(SinkTask.java:125)

vthornheart-bng commented 4 years ago

Hmm, interesting side effect of this is that it actually ends up sending the same message multiple times to GCP. Which I suppose makes sense: from the nature of this error message, it sounds like it's repeating sending the message because it never got a response from Google Pub/Sub.