airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.3k stars 4.15k forks source link

[helm] Pod creation fails with timeout #44443

Open jnatten opened 3 months ago

jnatten commented 3 months ago

Helm Chart Version

0.445.3

What step the error happened?

During the Sync

Relevant information

When running a sync-job. Creating a new destination or anything that spawns a new pod the frontend complains about unknown error (HTTP 504) and The provided log appears.

I have a similar test-cluster with the exact same configuration that works just fine. And I have attempted to install a completely fresh airbyte install in a new namespace.

Running on AWS EKS if it matters.

Any suggestions on how to fix it or how i should continue debugging would be greatly appreciated!

Relevant log output

2024-08-20 09:24:38 ERROR i.a.w.l.p.h.FailureHandler(apply):39 - Pipeline Error
io.airbyte.workload.launcher.pipeline.stages.model.StageError: io.airbyte.workload.launcher.pods.KubeClientException: Failed to create pod source-file-check-b43bf659-7773-4cf5-b204-8c37bd657c20-0-izuis.
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:46) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:38) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.$$access$$apply(Unknown Source) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Exec.dispatch(Unknown Source) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456) ~[micronaut-inject-4.5.4.jar:4.5.4]
  at io.micronaut.aop.chain.MethodInterceptorChain.proceed(MethodInterceptorChain.java:129) ~[micronaut-aop-4.5.4.jar:4.5.4]
  at io.airbyte.metrics.interceptors.InstrumentInterceptorBase.doIntercept(InstrumentInterceptorBase.kt:61) ~[io.airbyte.airbyte-metrics-metrics-lib-0.63.18.jar:?]
  at io.airbyte.metrics.interceptors.InstrumentInterceptorBase.intercept(InstrumentInterceptorBase.kt:44) ~[io.airbyte.airbyte-metrics-metrics-lib-0.63.18.jar:?]
  at io.micronaut.aop.chain.MethodInterceptorChain.proceed(MethodInterceptorChain.java:138) ~[micronaut-aop-4.5.4.jar:4.5.4]
  at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.apply(Unknown Source) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2571) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:2367) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onSubscribe(FluxOnErrorResume.java:74) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:193) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribe(Mono.java:4552) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.run(MonoSubscribeOn.java:126) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.scheduler.ImmediateScheduler$ImmediateSchedulerWorker.schedule(ImmediateScheduler.java:84) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoSubscribeOn.subscribeOrReturn(MonoSubscribeOn.java:55) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribe(Mono.java:4552) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribeWith(Mono.java:4634) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribe(Mono.java:4395) ~[reactor-core-3.6.8.jar:3.6.8]
  at io.airbyte.workload.launcher.pipeline.LaunchPipeline.accept(LaunchPipeline.kt:50) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:28) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:12) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.commons.temporal.queue.QueueActivityImpl.consume(Internal.kt:87) ~[io.airbyte-airbyte-commons-temporal-core-0.63.18.jar:?]
  at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[?:?]
  at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
  at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.common.interceptors.ActivityInboundCallsInterceptorBase.execute(ActivityInboundCallsInterceptorBase.java:39) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.opentracing.internal.OpenTracingActivityInboundCallsInterceptor.execute(OpenTracingActivityInboundCallsInterceptor.java:78) ~[temporal-opentracing-1.22.3.jar:?]
  at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:107) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:124) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:278) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:243) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:216) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
  at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: io.airbyte.workload.launcher.pods.KubeClientException: Failed to create pod source-file-check-b43bf659-7773-4cf5-b204-8c37bd657c20-0-izuis.
  at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:287) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:214) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:44) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  ... 53 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [patch]  for kind: [Pod]  with name: [source-file-check-b43bf659-7773-4cf5-b204-8c37bd657c20-0-izuis]  in namespace: [airbyte]  failed.
  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159) ~[kubernetes-client-api-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:233) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:236) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:251) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:1179) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:98) ~[kubernetes-client-6.12.1.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:57) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand$lambda$0(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.Functions.lambda$get$0(Functions.java:46) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) ~[failsafe-3.3.2.jar:3.3.2]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.create(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:284) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:214) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:44) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  ... 53 more
Caused by: java.io.IOException: timeout
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:419) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:397) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handlePatch(BaseOperation.java:764) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:231) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:236) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:251) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:1179) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:98) ~[kubernetes-client-6.12.1.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:57) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand$lambda$0(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.Functions.lambda$get$0(Functions.java:46) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) ~[failsafe-3.3.2.jar:3.3.2]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.create(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:284) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:214) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:44) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  ... 53 more
Caused by: java.io.InterruptedIOException: timeout
  at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) ~[okhttp-4.12.0.jar:?]
  ... 3 more
Caused by: java.io.IOException: Canceled
  at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) ~[okhttp-4.12.0.jar:?]
  ... 3 more
2024-08-20 09:24:38 INFO i.a.w.l.c.WorkloadApiClient(updateStatusToFailed):54 - Attempting to update workload: 778daa7c-feaf-4db6-96f3-70fd645acc77_b43bf659-7773-4cf5-b204-8c37bd657c20_0_check to FAILED.
2024-08-20 09:24:38 INFO i.a.w.l.p.h.FailureHandler(apply):62 - Pipeline aborted after error for workload: 778daa7c-feaf-4db6-96f3-70fd645acc77_b43bf659-7773-4cf5-b204-8c37bd657c20_0_check.
jnatten commented 3 months ago

After some investigation i figured out the problem goes away if i add a rule to our security group that allows all tcp traffic from control-plane to worker nodes.

Not sure why it is needed or why it worked without previously, but this seems to solve the issue consistently for us for now. Is there a specific port that is needed?

marcosmarxm commented 3 months ago

@davinchia can you check if this issue?

davinchia commented 3 months ago

@jnatten strange. Does your cluster have special security rules set up? We run Airbyte Cloud on EKS and have never seen this issue.

jnatten commented 3 months ago

Not sure if they are special, but the previous security group setup were something like this:

Worker node -> Cluster: 443 Worker node -> Worker node 53,1025 - 65535 Cluster -> worker node: 443,4443,6443,8443,9443,10250 Worker node -> outside world: all open

Think all of it is from the terraform eks module, but i could be wrong on that.

After allowing all ports from cluster -> worker nodes it started working. Not sure if we need all or just some, but i don't think its an issue for us to keep them open.

Elsayed91 commented 2 months ago

Happened to me after trying to upgrade a cluster. Had to helm uninstall and re-install and then it worked fine.