airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.48k stars 3.99k forks source link

Self-hosted Airbyte in Docker stuck and will not trigger sync #44833

Open henriquemeloo opened 2 weeks ago

henriquemeloo commented 2 weeks ago

Platform Version

0.50.34

What step the error happened?

Other

Relevant information

I'm hosting Airbyte in Docker and no jobs can be run. From the UI, attempting to trigger a sync, I get:

Failed to start sync: Server temporarily unavailable (http.502.u4xu7GZUwGb96wkAhjgxmJ)

If I try to run a test connection job, I get:

Server temporarily unavailable (http.502.998mm4fdPm3Tts4qMqmfRf)

Syncs with the Airflow AirbyteTriggerSyncOperator are also not working (output in logs).

The webapp seems to be ok, though, as I can see sources, destinations and connections.

From the attached logs, there seems to be something wrong with the Temporal service, where it seems to be stuck somehow. Running docker container stats shows that the temporal container is consuming a lot of CPU. Also, I can see that no job containers for connection checks or syncs are ever created.

I am not sure if the attached logs are sufficient; they are simply the first errors I could catch from the containers.

Relevant log output

Logs from the temporal container:
"level":"info","ts":"2024-08-27T19:22:28.972Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:90"}

Logs from the worker container:
x2024-08-27 19:13:33 WARN i.t.i.w.WorkflowWorker$TaskHandlerImpl(logExceptionDuringResultReporting):416 - Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=connection_manager_6b4ace68-3ee4-4c0f-bc35-f2b8ff9e6d80, RunId=b49c49fd-0e6a-4413-9ff1-1214c9d52403, startedEventId=0
io.grpc.StatusRuntimeException: NOT_FOUND: query task not found, or already expired

Response for Airflow `AirbyteTriggerSyncOperator`:
[2024-08-27, 00:40:26 UTC] {http.py:200} ERROR - HTTP error: Bad Gateway
[2024-08-27, 00:40:26 UTC] {http.py:201} ERROR - <!DOCTYPE html>
<html>
<head>
<title>Error</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>An error occurred.</h1>
<p>Sorry, the page you are looking for is currently unavailable.<br/>
Please try again later.</p>
<p>If you are the system administrator of this resource then you should check
the error log for details.</p>
<p><em>Faithfully yours, nginx.</em></p>
</body>
</html>
[2024-08-27, 00:40:26 UTC] {taskinstance.py:441} ▼ Post task execution logs
[2024-08-27, 00:40:26 UTC] {taskinstance.py:2905} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 198, in check_response
    response.raise_for_status()
  File "/home/airflow/.local/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: ...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/airbyte/operators/airbyte.py", line 81, in execute
    job_object = hook.submit_sync_connection(connection_id=self.connection_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/airbyte/hooks/airbyte.py", line 149, in submit_sync_connection
    return self.run(
           ^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 188, in run
    return self.run_and_check(session, prepped_request, extra_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 239, in run_and_check
    self.check_response(response)
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/http/hooks/http.py", line 202, in check_response
    raise AirflowException(str(response.status_code) + ":" + response.reason)
airflow.exceptions.AirflowException: 502:Bad Gateway
henriquemeloo commented 2 weeks ago

Might be a duplicate of https://github.com/airbytehq/airbyte/issues/30691

davinchia commented 2 weeks ago

@henriquemeloo can you give it a shot and let us know?

henriquemeloo commented 2 weeks ago

@davinchia it seems that the solution mentioned in that issue is to delete the temporal and temporal_visibility databases and restart Airbyte. Is it safe to do that?

davinchia commented 2 weeks ago

@henriquemeloo oops I should be clearer. The fix I'm referring to is to is this, essentially changing the temporal configs to increase its rate limits.

You should not need to delete the temporal and temporal_visibility databases.

henriquemeloo commented 2 weeks ago

@davinchia simply restarting the stack brought it back to life for about 24 hours, but then we ran into the same error in Temporal ("level":"info","ts":"2024-08-29T12:15:41.450Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:90"}). We have now set these configuration values in /etc/temporal/config/dynamicconfig/development.yaml:

frontend.namespaceCount:
  - value: 4096
    constraints: {}
frontend.namespaceRPS:
  - value: 76800
    constraints: {}
frontend.namespaceRPS.visibility:
  - value: 100
    constraints: {}
frontend.namespaceBurst.visibility:
  - value: 150
    constraints: {}

Should it also help to increase the number of Temporal replicas in our docker compose definition? Does that even work?

We have increased, in Airbyte, the following values:

AIRBYTE__MAX_DISCOVER_WORKERS=10
AIRBYTE__MAX_SYNC_WORKERS=10

and we are triggering no more than 200 syncs/second through the Airbyte API, besides a few GET requests to read resources, and, occasionally, jobs to update connections schema's.

henriquemeloo commented 2 weeks ago

We're still running into the same problem with those configurations. It seems that this problem happens after a burst of trigger sync requests, where we request around 145 connection syncs.

davinchia commented 1 week ago

@henriquemeloo sorry for the late reply - I'm out this week and wanted to drop a note so you didn't think I was ignoring you.

200/s is a high number! Very cool for me to learn you guys are using us at that level. Can you tell me more about the instance your Airbyte is running on and how long these jobs generally take?

Docker doesn't limit resource usages among containers by default (each container gets the entire instance resources) so my guess is your job spikes are overwhelming the entire deployment and leaving Airbyte in a bad state only recoverable via a restart. I'd recommend you guys move to Kubernetes as there are better resource guarantees/it's more scalable, though there is operational cost (learning K8s, K8s itself has more overhead).

henriquemeloo commented 1 day ago

@davinchia thanks for your reply! We have spread our sync job requests across the day and throttled any requests to the server to a maximum of 10 simultaneous requests, and that seemed to have solved it, but we just got the error again:

{"level":"info","ts":"2024-09-13T20:59:13.135Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:218"}

We are hosting Airbyte on a m6a.2xlarge instance. The sync jobs are usually short running (minutes long) but there are very few (around 5) that run daily and may take up to 12 hours to finish. Does it seem like we need a larger instance? What is the main limit here? CPU or memory? We have MAX_DISCOVER_WORKERS and MAX_SYNC_WORKERS both set to 10.

We used to host Aibyte on Kubernetes on EKS but there were frequent bugs in the helm chart that made us decide to simplify it to a Docker deployment. I see that Airbyte is sunsetting Docker deployment in favor of k8s, so we'll need to migrate next year.