airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.94k stars 3.84k forks source link

Airbyte is not stable in k8s env #38853

Open sivankumar86 opened 1 month ago

sivankumar86 commented 1 month ago

Topic

Airbyte sync stop after sometime

Relevant information

Hi Team, I am using helm chart to deploy airbyte in EKS. Job sync stopped after sometime and it gets resolved once we restart worker pods. it seems, some resource releasing issue with worker. let me know if you need more details.

Versions:

Helm : 0.94.x Airbyte : 0.61.x

Env:

io.airbyte.workers.exception.WorkerException: Failed to create pod for check step
    at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:197) ~[io.airbyte-airbyte-commons-worker-0.60.1.jar:?]
    at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:149) ~[io.airbyte-airbyte-commons-worker-0.60.1.jar:?]
    at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:71) ~[io.airbyte-airbyte-commons-worker-0.60.1.jar:?]
    at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:44) ~[io.airbyte-airbyte-commons-worker-0.60.1.jar:?]
    at io.airbyte.workers.temporal.TemporalAttemptExecution.get(TemporalAttemptExecution.java:142) ~[io.airbyte-airbyte-workers-0.60.1.jar:?]
    at io.airbyte.workers.temporal.check.connection.CheckConnectionActivityImpl.lambda$runWithJobOutput$1(CheckConnectionActivityImpl.java:226) ~[io.airbyte-airbyte-workers-0.60.1.jar:?]
    at io.airbyte.commons.temporal.HeartbeatUtils.withBackgroundHeartbeat(HeartbeatUtils.java:57) ~[io.airbyte-airbyte-commons-temporal-core-0.60.1.jar:?]
    at io.airbyte.workers.temporal.check.connection.CheckConnectionActivityImpl.runWithJobOutput(CheckConnectionActivityImpl.java:211) ~[io.airbyte-airbyte-workers-0.60.1.jar:?]
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[?:?]
    at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
    at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:107) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:124) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:278) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:243) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:216) ~[temporal-sdk-1.22.3.jar:?]
    at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "io.airbyte.workers.process.KubePortManagerSingleton.take()" is null
    at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:139) ~[io.airbyte-airbyte-commons-worker-0.60.1.jar:?]

c810ba10_3e93_4c4c_976f_8605746e4520_job_639176_attempt_1_txt.log

sivankumar86 commented 1 month ago

I think, failed job is not releasing resource but, not sure. Reproduce steps:

  1. Create a failed sync
  2. Create a k8s env with only one workers with default config (max 5)
  3. Run a sync and make sure it failed more than 5 times (~10 times)
  4. Now, all the sync would fails with unable to create a pod error .
  5. Restart the worker pod then sync job would start run.
marcosmarxm commented 1 month ago

Thanks for reporting the issue @sivankumar86 I included to the platform team for further investigation