airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.87k stars 4.07k forks source link

[platform] Container Orchestrator Intermittent WorkerException #34590

Open sivankumar86 opened 8 months ago

sivankumar86 commented 8 months ago

Helm Chart Version

0.50.4

What step the error happened?

None

Revelant information

Worker jobs intermittently failed and resolved after couple of hours . c810ba10_3e93_4c4c_976f_8605746e4520_job_509241_attempt_5_txt.log

Relevant log output

2024-01-17 02:25:28 ERROR i.a.w.t.s.a.AppendToAttemptLogActivityImpl(log):54 - Failing job: 509241, reason: Job failed after too many retries for connection b99dbe2c-8e26-4324-9fcb-3bf7e2580b0d 2024-01-17 02:30:27 ERROR i.a.w.g.DefaultCheckConnectionWorker(run):133 - Unexpected error while checking connection: io.airbyte.workers.exception.WorkerException: Failed to create pod for check step at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:188) ~[io.airbyte-airbyte-commons-worker-0.50.34.jar:?] at io.airbyte.workers.process.AirbyteIntegrationLauncher.check(AirbyteIntegrationLauncher.java:143) ~[io.airbyte-airbyte-commons-worker-0.50.34.jar:?] at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:71) ~[io.airbyte-airbyte-commons-worker-0.50.34.jar:?] at io.airbyte.workers.general.DefaultCheckConnectionWorker.run(DefaultCheckConnectionWorker.java:44) ~[io.airbyte-airbyte-commons-worker-0.50.34.jar:?] at io.airbyte.workers.temporal.TemporalAttemptExecution.get(TemporalAttemptExecution.java:135) ~[io.airbyte-airbyte-workers-0.50.34.jar:?] at io.airbyte.workers.temporal.check.connection.CheckConnectionActivityImpl.lambda$runWithJobOutput$1(CheckConnectionActivityImpl.java:133) ~[io.airbyte-airbyte-workers-0.50.34.jar:?] at io.airbyte.commons.temporal.HeartbeatUtils.withBackgroundHeartbeat(HeartbeatUtils.java:57) ~[io.airbyte-airbyte-commons-temporal-core-0.50.34.jar:?] at io.airbyte.workers.temporal.check.connection.CheckConnectionActivityImpl.runWithJobOutput(CheckConnectionActivityImpl.java:118) ~[io.airbyte-airbyte-workers-0.50.34.jar:?] at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:578) ~[?:?] at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:95) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:92) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:241) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:206) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:179) ~[temporal-sdk-1.17.0.jar:?] at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93) ~[temporal-sdk-1.17.0.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1589) ~[?:?] Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.Integer.intValue()" because the return value of "io.airbyte.workers.process.KubePortManagerSingleton.take()" is null at io.airbyte.workers.process.KubeProcessFactory.create(KubeProcessFactory.java:131) ~[io.airbyte-airbyte-commons-worker-0.50.34.jar:?] ... 20 more

Reproduce steps (not sure):

  1. Run many sync jobs in parallel/seq and make sure some jobs are failing during sync
  2. after sometime, you may face above exception
  3. restart worker would resolve the issue . looking for some insight on this .

airbyte setup :

1 . 0.50.34 version

  1. 15 workers with 40 ports open and 4 GB memory
  2. Running on orchestration mode
  3. 20-30 jobs every hours . mssql to snowflake and gitlab to snowflake
marcosmarxm commented 8 months ago

Please, you need to share more information about what is your deployment. What steps you executed until reached the problem for other people to reproduce.

sivankumar86 commented 8 months ago

@marcosmarxm added as much as possible . let me know if you need more details