airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.77k stars 4.04k forks source link

Sync job hangs for days #38345

Open EJSohn opened 4 months ago

EJSohn commented 4 months ago

Topic

Sync job hangs for days

Relevant information

Airbyte version: 0.57.2 Helm chart version: 0.64.151

Sometimes one or more sync jobs hang for days and never restart or fail. This happened for various connections including pagerduty, MySQL And I found the following error logs in the worker pod; no sync or destination pods were made for the problematic connection.

It seems like a bug and happens once a week or 2 weeks. Would it vanish if I updated the Airbyte to the latest version?

Thanks.

2024-05-20 08:38:25 WARN i.t.i.r.ReplayWorkflowTaskHandler(failureToWFTResult):279 - Workflow task processing failure. startedEventId=25, WorkflowId=7087f967-c53a-4f37-9b0d-1ef126398539, RunId=7f3752bd-1562-4ed8-bbcd-51662a6425fb. If seen continuously the workflow might be stuck.
io.temporal.internal.statemachines.InternalWorkflowTaskException: Failure handling event 25 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=25, CurrentStartedEventId=25}
        at io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:373) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:297) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:260) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:249) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:231) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:165) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:133) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:98) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:413) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:320) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
        at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:163) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:103) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:84) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:419) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:295) ~[temporal-sdk-1.22.3.jar:?]
        ... 13 more
Caused by: java.lang.NullPointerException: Cannot invoke "java.time.Duration.toMinutes()" because the return value of "io.airbyte.workers.temporal.check.connection.CheckConnectionActivity.getCheckConnectionTimeout()" is null
        at io.airbyte.workers.temporal.check.connection.CheckConnectionWorkflowImpl.getFailureReason(CheckConnectionWorkflowImpl.java:106) ~[io.airbyte-airbyte-workers-0.57.2.jar:?]
        at io.airbyte.workers.temporal.check.connection.CheckConnectionWorkflowImpl.run(CheckConnectionWorkflowImpl.java:67) ~[io.airbyte-airbyte-workers-0.57.2.jar:?]
        at CheckConnectionWorkflowImplProxy.run$accessor$Xx4G3cq1(Unknown Source) ~[?:?]
        at CheckConnectionWorkflowImplProxy$auxiliary$YngYNytd.call(Unknown Source) ~[?:?]
        at io.airbyte.micronaut.temporal.TemporalActivityStubInterceptor.execute(TemporalActivityStubInterceptor.java:79) ~[io.airbyte-airbyte-micronaut-temporal-0.57.2.jar:?]
        at CheckConnectionWorkflowImplProxy.run(Unknown Source) ~[?:?]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[?:?]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
        at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:339) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:314) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.WorkflowExecutionHandler.runWorkflowMethod(WorkflowExecutionHandler.java:70) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.SyncWorkflow.lambda$start$0(SyncWorkflow.java:135) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:107) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.worker.ActiveThreadReportingExecutor.lambda$submit$0(ActiveThreadReportingExecutor.java:53) ~[temporal-sdk-1.22.3.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        ... 3 more
2024-05-20 08:38:25 ERROR i.t.i.w.PollerOptions$Builder(lambda$build$0):168 - uncaught exception
java.lang.RuntimeException: Failure processing workflow task. WorkflowId=7087f967-c53a-4f37-9b0d-1ef126398539, RunId=7f3752bd-1562-4ed8-bbcd-51662a6425fb, Attempt=1071
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:404) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:261) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:110) ~[temporal-sdk-1.22.3.jar:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: io.temporal.internal.statemachines.InternalWorkflowTaskException: Failure handling event 25 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=25, CurrentStartedEventId=25}
        at io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:373) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:297) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:260) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:249) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:231) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:165) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:133) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:98) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:413) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:320) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
        ... 3 more
Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]
        at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:163) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:103) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:84) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:419) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:295) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:260) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:249) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:231) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:165) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:133) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:98) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:413) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:320) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
        ... 3 more
Caused by: java.lang.NullPointerException: Cannot invoke "java.time.Duration.toMinutes()" because the return value of "io.airbyte.workers.temporal.check.connection.CheckConnectionActivity.getCheckConnectionTimeout()" is null
        at io.airbyte.workers.temporal.check.connection.CheckConnectionWorkflowImpl.getFailureReason(CheckConnectionWorkflowImpl.java:106) ~[io.airbyte-airbyte-workers-0.57.2.jar:?]
        at io.airbyte.workers.temporal.check.connection.CheckConnectionWorkflowImpl.run(CheckConnectionWorkflowImpl.java:67) ~[io.airbyte-airbyte-workers-0.57.2.jar:?]
        at CheckConnectionWorkflowImplProxy.run$accessor$Xx4G3cq1(Unknown Source) ~[?:?]
        at CheckConnectionWorkflowImplProxy$auxiliary$YngYNytd.call(Unknown Source) ~[?:?]
        at io.airbyte.micronaut.temporal.TemporalActivityStubInterceptor.execute(TemporalActivityStubInterceptor.java:79) ~[io.airbyte-airbyte-micronaut-temporal-0.57.2.jar:?]
        at CheckConnectionWorkflowImplProxy.run(Unknown Source) ~[?:?]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[?:?]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
        at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation$RootWorkflowInboundCallsInterceptor.execute(POJOWorkflowImplementationFactory.java:339) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.execute(POJOWorkflowImplementationFactory.java:314) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.WorkflowExecutionHandler.runWorkflowMethod(WorkflowExecutionHandler.java:70) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.SyncWorkflow.lambda$start$0(SyncWorkflow.java:135) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:107) ~[temporal-sdk-1.22.3.jar:?]
        at io.temporal.worker.ActiveThreadReportingExecutor.lambda$submit$0(ActiveThreadReportingExecutor.java:53) ~[temporal-sdk-1.22.3.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
        ... 3 more
marcosmarxm commented 4 months ago

Would it vanish if I updated the Airbyte to the latest version?

The latest version of Airbyte has hearbeat timeout prevent this to happen.

@davinchia can you confirm?