[platform] Significant sync instability after upgrading from 0.44.4 to 0.50.38

kzvezdarov commented 10 months ago

Helm Chart Version

0.50.17

What step the error happened?

During the Sync

Revelant information

After upgrading from Airbyte 0.44.4 to 0.50.33 (and later through the patch releases to 0.50.38) we’ve noticed a significant amount of sync instability. This manifests as sync jobs needing multiple attempts to succeed (previously most syncs succeeded right away) and an elevated failure rate.

We’ve deployed Airbyte to a GKE Autopilot cluster, using manifests rendered from the latest (0.50.17) Helm chart. All of our connections use a standard source connector - e.g. Salesforce, Hubspot, and a custom destination connector that writes to our internal API.

Prior to the upgrade, on 0.44.4:

the mean number of attempts for a successful sync job was ~1
the sync success rate was 0.9

After the upgrade to 0.50.33:

mean attempts per successful sync has increased to 2.1
sync success rate has decreased to 0.64

To put that into a visual perspective, here is a heatmap of each daily sync per connection over the lifetime of our deployment. The cells correspond to individual sync attempts, shaded on a scale of -5 to 5 to represent attempt counts for failed/successful syncs, with 0 corresponding to no sync for that particular connection (e.g. -5 represents a sync that failed on the 5th attempt, whereas 2 represents a sync that succeeded on the 2nd attempt):

Sync job status and attempts per connection (all
time)

Our deployment was updated on Nov. 21st - the change in sync reliability is immediately visible.

Relevant log output

From the job logs we can see that two errors dominate attempt failures for syncs after the upgrade:

java.lang.RuntimeException: java.lang.NullPointerException
 at io.airbyte.workers.general.ReplicationWorkerHelper.startSource(ReplicationWorkerHelper.kt:135)
 at io.airbyte.workers.general.BufferedReplicationWorker.lambda$run$1(BufferedReplicationWorker.java:168)
 at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:236)
 at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: java.lang.NullPointerException
 at java.base/java.io.Reader.<init>(Reader.java:168)
 at java.base/java.io.InputStreamReader.<init>(InputStreamReader.java:124)
 at io.airbyte.commons.io.IOs.newBufferedReader(IOs.java:74)
 at io.airbyte.commons.io.LineGobbler.<init>(LineGobbler.java:134)
 at io.airbyte.commons.io.LineGobbler.gobble(LineGobbler.java:73)
 at io.airbyte.workers.internal.DefaultAirbyteSource.start(DefaultAirbyteSource.java:92)
 at io.airbyte.workers.general.ReplicationWorkerHelper.startSource(ReplicationWorkerHelper.kt:133)
 ... 6 more\n

and

java.lang.RuntimeException: java.lang.NullPointerException
 at io.airbyte.workers.general.ReplicationWorkerHelper.startDestination(ReplicationWorkerHelper.kt:117)
 at io.airbyte.workers.general.BufferedReplicationWorker.lambda$run$0(BufferedReplicationWorker.java:167)
 at io.airbyte.workers.general.BufferedReplicationWorker.lambda$runAsync$2(BufferedReplicationWorker.java:236)
 at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: java.lang.NullPointerException
 at java.base/java.io.Reader.<init>(Reader.java:168)
 at java.base/java.io.InputStreamReader.<init>(InputStreamReader.java:124)
 at io.airbyte.commons.io.IOs.newBufferedReader(IOs.java:74)
 at io.airbyte.commons.io.LineGobbler.<init>(LineGobbler.java:134)
 at io.airbyte.commons.io.LineGobbler.gobble(LineGobbler.java:73)
 at io.airbyte.workers.internal.DefaultAirbyteDestination.start(DefaultAirbyteDestination.java:100)
 at io.airbyte.workers.general.ReplicationWorkerHelper.startDestination(ReplicationWorkerHelper.kt:115)
 ... 6 more

Finally, inspecting the failed pods shows that the main, call-heartbeat-server, and most often remote-stdin are all in an error state. The following is the entirety of the container logs:

2023-12-20 13:32:30.246 EST All files copied successfully, exiting with code 0...
2023-12-20 13:32:31.621 EST Using existing AIRBYTE_ENTRYPOINT: python /airbyte/integration_code/main.py
2023-12-20 13:32:31.622 EST Waiting on CHILD_PID 7
2023-12-20 13:32:31.622 EST PARENT_PID: 1
2023-12-20 13:32:43.705 EST 2023/12/20 18:32:43 socat[1] N listening on AF=2 0.0.0.0:9001
2023-12-20 13:33:32.184 EST 2023/12/20 18:33:32 socat[8] N reading from and writing to stdio
2023-12-20 13:33:32.184 EST 2023/12/20 18:33:32 socat[8] N opening connection to AF=2 192.168.67.227:9879
2023-12-20 13:33:32.284 EST 2023/12/20 18:33:32 socat[8] N successfully connected from local address AF=2 192.168.70.47:35214
2023-12-20 13:33:32.284 EST 2023/12/20 18:33:32 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
2023-12-20 13:34:19.677 EST 2023/12/20 18:34:19 socat[1] N accepting connection from AF=2 192.168.67.227:48308 on AF=2 192.168.70.47:9001
2023-12-20 13:34:19.677 EST 2023/12/20 18:34:19 socat[1] N using stdout for reading and writing
2023-12-20 13:34:19.677 EST 2023/12/20 18:34:19 socat[1] N starting data transfer loop with FDs [6,6] and [1,1]
2023-12-20 13:34:19.884 EST 2023/12/20 18:34:19 socat[8] N reading from and writing to stdio
2023-12-20 13:34:19.884 EST 2023/12/20 18:34:19 socat[8] N opening connection to AF=2 192.168.67.227:9878
2023-12-20 13:34:19.985 EST 2023/12/20 18:34:19 socat[8] N successfully connected from local address AF=2 192.168.70.47:48784
2023-12-20 13:34:19.985 EST 2023/12/20 18:34:19 socat[8] N starting data transfer loop with FDs [0,1] and [5,5]
2023-12-20 13:34:20.572 EST 2023/12/20 18:34:20 socat[8] N socket 2 (fd 5) is at EOF
2023-12-20 13:34:20.572 EST 2023/12/20 18:34:20 socat[1] N socket 1 (fd 6) is at EOF
2023-12-20 13:34:20.572 EST 2023/12/20 18:34:20 socat[8] N socket 2 (fd 5) is at EOF
2023-12-20 13:34:20.757 EST Heartbeat to worker failed, exiting...
2023-12-20 13:34:20.758 EST received ABRT
2023-12-20 13:34:20.763 EST 2023/12/20 18:34:20 socat[1] E read(1, 0x7f71863c1000, 8192): Bad file descriptor
2023-12-20 13:34:20.763 EST 2023/12/20 18:34:20 socat[1] N exit(1)
2023-12-20 13:34:20.763 EST 2023/12/20 18:34:20 socat[8] N socket 1 (fd 0) is at EOF
2023-12-20 13:34:20.763 EST 2023/12/20 18:34:20 socat[8] N socket 1 (fd 0) is at EOF
2023-12-20 13:34:20.763 EST 2023/12/20 18:34:20 socat[8] N exiting with status 0
2023-12-20 13:34:20.763 EST 2023/12/20 18:34:20 socat[8] N exiting with status 0

Job attempt log: job_attempt.txt

The Airbyte configuration looks like so (with some private resource names omitted):

AIRBYTE_VERSION=0.50.38
AIRBYTE_EDITION=
API_URL=/api/v1/
CONNECTOR_BUILDER_API_URL=/connector-builder-api
CONFIG_API_HOST=airbyte-server-svc:8001
AIRBYTE_API_HOST=airbyte-api-server-svc:80
CONFIG_ROOT=/configs
CONFIGS_DATABASE_MINIMUM_FLYWAY_MIGRATION_VERSION=0.35.15.001
DATA_DOCKER_MOUNT=airbyte_data
DB_DOCKER_MOUNT=airbyte_db
GCS_LOG_BUCKET=<gcs logs bucket>
INTERNAL_API_HOST=airbyte-server-svc:8001
WORKLOAD_API_HOST=airbyte-workload-api:8007
CONNECTOR_BUILDER_API_HOST=airbyte-connector-builder-server-svc:8003

# Docker Resource Limits
JOB_MAIN_CONTAINER_CPU_REQUEST=1
JOB_MAIN_CONTAINER_CPU_LIMIT=1
JOB_MAIN_CONTAINER_MEMORY_REQUEST=4Gi
JOB_MAIN_CONTAINER_MEMORY_LIMIT=4Gi
JOBS_DATABASE_MINIMUM_FLYWAY_MIGRATION_VERSION=0.29.15.001

LOCAL_ROOT=/tmp/airbyte_local
RUN_DATABASE_MIGRATION_ON_STARTUP=true

# S3/Minio Log Configuration
S3_LOG_BUCKET=
S3_LOG_BUCKET_REGION=
S3_MINIO_ENDPOINT=
S3_PATH_STYLE_ACCESS=
STATE_STORAGE_MINIO_BUCKET_NAME=
STATE_STORAGE_MINIO_ENDPOINT=

SEGMENT_WRITE_KEY=

TEMPORAL_HOST=airbyte-temporal-svc:7233
TEMPORAL_WORKER_PORTS=9001,9002,9003,9004,9005,9006,9007,9008,9009,9010,9011,9012,9013,9014,9015,9016,9017,9018,9019,9020,9021,9022,9023,9024,9025,9026,9027,9028,9029,9030,9031,9032,9033,9034,9035,9036,9037,9038,9039,9040
TRACKING_STRATEGY=logging
WEBAPP_URL=airbyte-webapp-svc:80
WORKER_ENVIRONMENT=kubernetes
WORKSPACE_DOCKER_MOUNT=airbyte_workspace
WORKSPACE_ROOT=/workspace

# Open Telemetry Configuration
METRIC_CLIENT=otel
OTEL_COLLECTOR_ENDPOINT=http://otel-collector:4317

# Temporal Activity configuration
ACTIVITY_DISCOVERY_TIMEOUT=60
ACTIVITY_MAX_ATTEMPT=5
ACTIVITY_MAX_DELAY_BETWEEN_ATTEMPTS_SECONDS=600
ACTIVITY_MAX_TIMEOUT_SECOND=300
ACTIVITY_INITIAL_DELAY_BETWEEN_ATTEMPTS_SECONDS=30

WORKFLOW_FAILURE_RESTART_DELAY_SECONDS=600

USE_STREAM_CAPABLE_STATE=true
AUTO_DETECT_SCHEMA=true

CONTAINER_ORCHESTRATOR_IMAGE=

# Worker #
WORKERS_MICRONAUT_ENVIRONMENTS=control-plane

# Cron #
CRON_MICRONAUT_ENVIRONMENTS=control-plane

# Launcher #
LAUNCHER_MICRONAUT_ENVIRONMENTS=control-plane,oss

# Cloud Storage Configuration
WORKER_LOGS_STORAGE_TYPE=gcs
WORKER_STATE_STORAGE_TYPE=gcs

SHOULD_RUN_NOTIFY_WORKFLOWS=false
MAX_NOTIFY_WORKERS=5

# Airbyte Internal Database, see https://docs.airbyte.io/operator-guides/configuring-airbyte-db
DATABASE_HOST=airbyte-db-svc
DATABASE_PORT=5432
DATABASE_DB=db-airbyte
# translate manually DATABASE_URL=jdbc:postgresql://${DATABASE_HOST}:${DATABASE_PORT/${DATABASE_DB}
DATABASE_URL=jdbc:postgresql://airbyte-db-svc:5432/db-airbyte

LOG_LEVEL=DEBUG

# GCS Log Configuration
STATE_STORAGE_GCS_BUCKET_NAME=<gcs state bucket>

NORMALIZATION_JOB_MAIN_CONTAINER_MEMORY_LIMIT=
NORMALIZATION_JOB_MAIN_CONTAINER_MEMORY_REQUEST=
NORMALIZATION_JOB_MAIN_CONTAINER_CPU_LIMIT=
NORMALIZATION_JOB_MAIN_CONTAINER_CPU_REQUEST=

# Worker pod tolerations, annotations and node selectors
JOB_KUBE_TOLERATIONS=
JOB_KUBE_ANNOTATIONS=
JOB_KUBE_NODE_SELECTORS=

# Job image pull policy
JOB_KUBE_MAIN_CONTAINER_IMAGE_PULL_POLICY=

STRICT_COMPARISON_NORMALIZATION_TAG=
STRICT_COMPARISON_NORMALIZATION_WORKSPACES=

CONNECTOR_SPECIFIC_RESOURCE_DEFAULTS_ENABLED=false

# just a placeholder so that nginx template is valid - shouldn't be used when edition isn't "pro"
KEYCLOAK_INTERNAL_HOST=localhost

# pod sweeper
RUNNING_TTL_MINUTES=60
SUCCEEDED_TTL_MINUTES=1
UNSUCCESSFUL_TTL_MINUTES=30

Hesperide commented 9 months ago

Thank you for opening this and all of the details @kzvezdarov. The team is currently digging in, and there's a possibility that the error message being thrown is a red herring.

We're going to release a change in the coming days which at minimum should help identify what the root cause is under the hood. We'll follow up when the newest release is available.

kzvezdarov commented 9 months ago

Sounds great @Hesperide, looking forward to it! Happy to provide any additional information that would help with diagnosing this.

jdpgrailsdev commented 9 months ago

@kzvezdarov We have released version 0.50.42 which contains a fix for the NullPointerException. Please test this out at your earliest convenience to see if it fixes the issue or exposes the underlying cause.

kzvezdarov commented 9 months ago

Thanks @jdpgrailsdev, that's awesome. I've pushed it to our cluster, I'll update here as soon as I've gathered some data.

kzvezdarov commented 9 months ago

Belated update after running 0.50.42 for about a week:

The primary error which was unmasked was an activity timeout:

Activity with activityType='RunWithJobOutput' failed: 'Activity task timed out'. scheduledEventId=12, startedEventId=13, activityId=b1b7b9f0-8c29-3f95-b753-f8a0c090130b, identity='', retryState=RETRY_STATE_NON_RETRYABLE_FAILURE

We had already bumped ACTIVITY_MAX_TIMEOUT_SECOND to 300s (the default is 120s from what I can tell here ); increasing this to 600, and later 1800 resolved the issue, bringing us back to ~1.11 attempts per sync, down from 2.68 pre-upgrade.

Worth noting that increasing the timeout value is not sufficient on its own - running that configuration on 0.50.38 did not improve stability at all.

Thanks a ton for the fix!

marcosmarxm commented 5 months ago

Closed as it was fixed in version 0.50.42

dimisjim commented 2 weeks ago

ACTIVITY_MAX_TIMEOUT_SECOND doesn't seem to be a valid config in 0.64.7 (I am facing similar issues)

Is activityMaxDelayBetweenAttemptsSeconds / ACTIVITY_MAX_DELAY_BETWEEN_ATTEMPTS_SECONDS the replacement for that?

airbytehq / airbyte