Self-hosted Airbyte in Docker stuck and will not trigger sync

henriquemeloo commented 3 months ago

Platform Version

0.50.34

What step the error happened?

Other

Relevant information

I'm hosting Airbyte in Docker and no jobs can be run. From the UI, attempting to trigger a sync, I get:

Failed to start sync: Server temporarily unavailable (http.502.u4xu7GZUwGb96wkAhjgxmJ)

If I try to run a test connection job, I get:

Server temporarily unavailable (http.502.998mm4fdPm3Tts4qMqmfRf)

Syncs with the Airflow AirbyteTriggerSyncOperator are also not working (output in logs).

The webapp seems to be ok, though, as I can see sources, destinations and connections.

From the attached logs, there seems to be something wrong with the Temporal service, where it seems to be stuck somehow. Running docker container stats shows that the temporal container is consuming a lot of CPU. Also, I can see that no job containers for connection checks or syncs are ever created.

I am not sure if the attached logs are sufficient; they are simply the first errors I could catch from the containers.

Relevant log output

Logs from the temporal container:
"level":"info","ts":"2024-08-27T19:22:28.972Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:90"}

Logs from the worker container:
x2024-08-27 19:13:33 WARN i.t.i.w.WorkflowWorker$TaskHandlerImpl(logExceptionDuringResultReporting):416 - Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=connection_manager_6b4ace68-3ee4-4c0f-bc35-f2b8ff9e6d80, RunId=b49c49fd-0e6a-4413-9ff1-1214c9d52403, startedEventId=0
io.grpc.StatusRuntimeException: NOT_FOUND: query task not found, or already expired

Response for Airflow `AirbyteTriggerSyncOperator`:
[2024-08-27, 00:40:26 UTC] {http.py:200} ERROR - HTTP error: Bad Gateway
[2024-08-27, 00:40:26 UTC] {http.py:201} ERROR - <!DOCTYPE html>
<html>
<head>
<title>Error</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>An error occurred.</h1>
<p>Sorry, the page you are looking for is currently unavailable.<br/>
Please try again later.</p>
<p>If you are the system administrator of this resource then you should check
the error log for details.</p>
<p><em>Faithfully yours, nginx.</em></p>
</body>
</html>
[2024-08-27, 00:40:26 UTC] {taskinstance.py:441} ▼ Post task execution logs
[2024-08-27, 00:40:26 UTC] {taskinstance.py:2905} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 198, in check_response
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: ...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/airbyte/operators/airbyte.py", line 81, in execute
    job_object = hook.submit_sync_connection(connection_id=self.connection_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/airbyte/hooks/airbyte.py", line 149, in submit_sync_connection
    return self.run(
           ^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 188, in run
    return self.run_and_check(session, prepped_request, extra_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 239, in run_and_check
    self.check_response(response)
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 202, in check_response
    raise AirflowException(str(response.status_code) + ":" + response.reason)
airflow.exceptions.AirflowException: 502:Bad Gateway

henriquemeloo commented 3 months ago

Might be a duplicate of https://github.com/airbytehq/airbyte/issues/30691

davinchia commented 3 months ago

@henriquemeloo can you give it a shot and let us know?

henriquemeloo commented 3 months ago

@davinchia it seems that the solution mentioned in that issue is to delete the temporal and temporal_visibility databases and restart Airbyte. Is it safe to do that?

davinchia commented 3 months ago

@henriquemeloo oops I should be clearer. The fix I'm referring to is to is this, essentially changing the temporal configs to increase its rate limits.

You should not need to delete the temporal and temporal_visibility databases.

henriquemeloo commented 3 months ago

@davinchia simply restarting the stack brought it back to life for about 24 hours, but then we ran into the same error in Temporal ("level":"info","ts":"2024-08-29T12:15:41.450Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:90"}). We have now set these configuration values in /etc/temporal/config/dynamicconfig/development.yaml:

frontend.namespaceCount:
  - value: 4096
    constraints: {}
frontend.namespaceRPS:
  - value: 76800
    constraints: {}
frontend.namespaceRPS.visibility:
  - value: 100
    constraints: {}
frontend.namespaceBurst.visibility:
  - value: 150
    constraints: {}

Should it also help to increase the number of Temporal replicas in our docker compose definition? Does that even work?

We have increased, in Airbyte, the following values:

AIRBYTE__MAX_DISCOVER_WORKERS=10
AIRBYTE__MAX_SYNC_WORKERS=10

and we are triggering no more than 200 syncs/second through the Airbyte API, besides a few GET requests to read resources, and, occasionally, jobs to update connections schema's.

henriquemeloo commented 3 months ago

We're still running into the same problem with those configurations. It seems that this problem happens after a burst of trigger sync requests, where we request around 145 connection syncs.

davinchia commented 2 months ago

@henriquemeloo sorry for the late reply - I'm out this week and wanted to drop a note so you didn't think I was ignoring you.

200/s is a high number! Very cool for me to learn you guys are using us at that level. Can you tell me more about the instance your Airbyte is running on and how long these jobs generally take?

Docker doesn't limit resource usages among containers by default (each container gets the entire instance resources) so my guess is your job spikes are overwhelming the entire deployment and leaving Airbyte in a bad state only recoverable via a restart. I'd recommend you guys move to Kubernetes as there are better resource guarantees/it's more scalable, though there is operational cost (learning K8s, K8s itself has more overhead).

henriquemeloo commented 2 months ago

@davinchia thanks for your reply! We have spread our sync job requests across the day and throttled any requests to the server to a maximum of 10 simultaneous requests, and that seemed to have solved it, but we just got the error again:

{"level":"info","ts":"2024-09-13T20:59:13.135Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:218"}

We are hosting Airbyte on a m6a.2xlarge instance. The sync jobs are usually short running (minutes long) but there are very few (around 5) that run daily and may take up to 12 hours to finish. Does it seem like we need a larger instance? What is the main limit here? CPU or memory? We have MAX_DISCOVER_WORKERS and MAX_SYNC_WORKERS both set to 10.

We used to host Aibyte on Kubernetes on EKS but there were frequent bugs in the helm chart that made us decide to simplify it to a Docker deployment. I see that Airbyte is sunsetting Docker deployment in favor of k8s, so we'll need to migrate next year.

henriquemeloo commented 2 months ago

It looks like the Temporal databases may have gotten a bit large. The largest table in database temporal, history_node has around 1,6M records. I've tried setting TEMPORAL_HISTORY_RETENTION_IN_DAYS=5 to see if it helps.

henriquemeloo commented 2 months ago

The proxy container logs quite a few errors, and they all seem to be either [alert] 11#11: *21080 512 worker_connections are not enough while connecting to upstream, client: XXXXXXX, server: , request: "GET /v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93 HTTP/1.1", upstream: "http://172.28.0.3:8006/v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93", host: "XXXXXXX" or [error] 11#11: *20493 upstream timed out (110: Connection timed out) while reading response header from upstream, client: XXXXXXX, server: , request: "GET /v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93 HTTP/1.1", upstream: "http://172.28.0.3:8006/v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93", host: "XXXXXXX" I don't know if this provides any useful information, though.

henriquemeloo commented 1 month ago

We endend up upgrading to Airbyte v1.1.0 deployed via abctl, and we are still running into this problem:

{"level":"info","ts":"2024-10-07T13:29:15.145Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"} {"level":"info","ts":"2024-10-07T13:29:15.153Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}

Temporal shows a few new warnings as well:

{"level":"warn","ts":"2024-10-07T13:31:06.245Z","msg":"Unspecified task queue kind","service":"frontend","wf-task-queue-name":"GET_SPEC","wf-namespace":"default","logging-call-at":"workflow_handler.go:3772"}
{"level":"warn","ts":"2024-10-07T13:30:49.145Z","msg":"Per shard per namespace RPS warn limit exceeded","service":"history","shard-id":2,"wf-namespace":"default","rps":65,"logging-call-at":"health_signal_aggregator.go:171"}
{"level":"warn","ts":"2024-10-07T13:30:49.145Z","msg":"Per shard RPS warn limit exceeded","service":"history","shard-id":2,"rps":65,"logging-call-at":"health_signal_aggregator.go:178"}
{"level":"warn","ts":"2024-10-07T13:30:49.146Z","msg":"Per shard per namespace RPS warn limit exceeded","service":"history","shard-id":3,"wf-namespace":"default","rps":115,"logging-call-at":"health_signal_aggregator.go:171"}

henriquemeloo commented 1 month ago

@davinchia could this be caused or aggravated by using an endpoint from the Configuration API? We make frequent requests to v1/web_backend/connections/get and v1/web_backend/connections/update, and throttling those seems to have improved the situation.

airbytehq / airbyte