Closed kzvezdarov closed 5 months ago
Thank you for opening this and all of the details @kzvezdarov. The team is currently digging in, and there's a possibility that the error message being thrown is a red herring.
We're going to release a change in the coming days which at minimum should help identify what the root cause is under the hood. We'll follow up when the newest release is available.
Sounds great @Hesperide, looking forward to it! Happy to provide any additional information that would help with diagnosing this.
@kzvezdarov We have released version 0.50.42
which contains a fix for the NullPointerException
. Please test this out at your earliest convenience to see if it fixes the issue or exposes the underlying cause.
Thanks @jdpgrailsdev, that's awesome. I've pushed it to our cluster, I'll update here as soon as I've gathered some data.
Belated update after running 0.50.42 for about a week:
The primary error which was unmasked was an activity timeout:
Activity with activityType='RunWithJobOutput' failed: 'Activity task timed out'. scheduledEventId=12, startedEventId=13, activityId=b1b7b9f0-8c29-3f95-b753-f8a0c090130b, identity='', retryState=RETRY_STATE_NON_RETRYABLE_FAILURE
We had already bumped ACTIVITY_MAX_TIMEOUT_SECOND
to 300s (the default is 120s from what I can tell here ); increasing this to 600, and later 1800 resolved the issue, bringing us back to ~1.11 attempts per sync, down from 2.68 pre-upgrade.
Worth noting that increasing the timeout value is not sufficient on its own - running that configuration on 0.50.38 did not improve stability at all.
Thanks a ton for the fix!
Closed as it was fixed in version 0.50.42
ACTIVITY_MAX_TIMEOUT_SECOND doesn't seem to be a valid config in 0.64.7 (I am facing similar issues)
Is activityMaxDelayBetweenAttemptsSeconds / ACTIVITY_MAX_DELAY_BETWEEN_ATTEMPTS_SECONDS the replacement for that?
Helm Chart Version
0.50.17
What step the error happened?
During the Sync
Revelant information
After upgrading from Airbyte
0.44.4
to0.50.33
(and later through the patch releases to0.50.38
) we’ve noticed a significant amount of sync instability. This manifests as sync jobs needing multiple attempts to succeed (previously most syncs succeeded right away) and an elevated failure rate.We’ve deployed Airbyte to a GKE Autopilot cluster, using manifests rendered from the latest (
0.50.17
) Helm chart. All of our connections use a standard source connector - e.g. Salesforce, Hubspot, and a custom destination connector that writes to our internal API.Prior to the upgrade, on
0.44.4
:After the upgrade to
0.50.33
:To put that into a visual perspective, here is a heatmap of each daily sync per connection over the lifetime of our deployment. The cells correspond to individual sync attempts, shaded on a scale of -5 to 5 to represent attempt counts for failed/successful syncs, with 0 corresponding to no sync for that particular connection (e.g. -5 represents a sync that failed on the 5th attempt, whereas 2 represents a sync that succeeded on the 2nd attempt):
Our deployment was updated on Nov. 21st - the change in sync reliability is immediately visible.
Relevant log output
From the job logs we can see that two errors dominate attempt failures for syncs after the upgrade:
and
Finally, inspecting the failed pods shows that the
main
,call-heartbeat-server
, and most oftenremote-stdin
are all in an error state. The following is the entirety of the container logs:Job attempt log: job_attempt.txt
The Airbyte configuration looks like so (with some private resource names omitted):