Open henriquemeloo opened 3 months ago
Might be a duplicate of https://github.com/airbytehq/airbyte/issues/30691
@henriquemeloo can you give it a shot and let us know?
@davinchia it seems that the solution mentioned in that issue is to delete the temporal
and temporal_visibility
databases and restart Airbyte. Is it safe to do that?
@henriquemeloo oops I should be clearer. The fix I'm referring to is to is this, essentially changing the temporal configs to increase its rate limits.
You should not need to delete the temporal
and temporal_visibility
databases.
@davinchia simply restarting the stack brought it back to life for about 24 hours, but then we ran into the same error in Temporal ("level":"info","ts":"2024-08-29T12:15:41.450Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:90"}
). We have now set these configuration values in /etc/temporal/config/dynamicconfig/development.yaml
:
frontend.namespaceCount:
- value: 4096
constraints: {}
frontend.namespaceRPS:
- value: 76800
constraints: {}
frontend.namespaceRPS.visibility:
- value: 100
constraints: {}
frontend.namespaceBurst.visibility:
- value: 150
constraints: {}
Should it also help to increase the number of Temporal replicas in our docker compose definition? Does that even work?
We have increased, in Airbyte, the following values:
AIRBYTE__MAX_DISCOVER_WORKERS=10
AIRBYTE__MAX_SYNC_WORKERS=10
and we are triggering no more than 200 syncs/second through the Airbyte API, besides a few GET requests to read resources, and, occasionally, jobs to update connections schema's.
We're still running into the same problem with those configurations. It seems that this problem happens after a burst of trigger sync requests, where we request around 145 connection syncs.
@henriquemeloo sorry for the late reply - I'm out this week and wanted to drop a note so you didn't think I was ignoring you.
200/s is a high number! Very cool for me to learn you guys are using us at that level. Can you tell me more about the instance your Airbyte is running on and how long these jobs generally take?
Docker doesn't limit resource usages among containers by default (each container gets the entire instance resources) so my guess is your job spikes are overwhelming the entire deployment and leaving Airbyte in a bad state only recoverable via a restart. I'd recommend you guys move to Kubernetes as there are better resource guarantees/it's more scalable, though there is operational cost (learning K8s, K8s itself has more overhead).
@davinchia thanks for your reply! We have spread our sync job requests across the day and throttled any requests to the server to a maximum of 10 simultaneous requests, and that seemed to have solved it, but we just got the error again:
{"level":"info","ts":"2024-09-13T20:59:13.135Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:218"}
We are hosting Airbyte on a m6a.2xlarge instance. The sync jobs are usually short running (minutes long) but there are very few (around 5) that run daily and may take up to 12 hours to finish. Does it seem like we need a larger instance? What is the main limit here? CPU or memory? We have MAX_DISCOVER_WORKERS
and MAX_SYNC_WORKERS
both set to 10.
We used to host Aibyte on Kubernetes on EKS but there were frequent bugs in the helm chart that made us decide to simplify it to a Docker deployment. I see that Airbyte is sunsetting Docker deployment in favor of k8s, so we'll need to migrate next year.
It looks like the Temporal databases may have gotten a bit large. The largest table in database temporal
, history_node
has around 1,6M records. I've tried setting TEMPORAL_HISTORY_RETENTION_IN_DAYS=5
to see if it helps.
The proxy container logs quite a few errors, and they all seem to be either
[alert] 11#11: *21080 512 worker_connections are not enough while connecting to upstream, client: XXXXXXX, server: , request: "GET /v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93 HTTP/1.1", upstream: "http://172.28.0.3:8006/v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93", host: "XXXXXXX"
or
[error] 11#11: *20493 upstream timed out (110: Connection timed out) while reading response header from upstream, client: XXXXXXX, server: , request: "GET /v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93 HTTP/1.1", upstream: "http://172.28.0.3:8006/v1/workspaces/c6159056-a275-4e2a-afb5-a9fd1b00ec93", host: "XXXXXXX"
I don't know if this provides any useful information, though.
We endend up upgrading to Airbyte v1.1.0 deployed via abctl
, and we are still running into this problem:
{"level":"info","ts":"2024-10-07T13:29:15.145Z","msg":"matching client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:219"} {"level":"info","ts":"2024-10-07T13:29:15.153Z","msg":"history client encountered error","service":"frontend","error":"service rate limit exceeded","service-error-type":"serviceerror.ResourceExhausted","logging-call-at":"metric_client.go:104"}
Temporal shows a few new warnings as well:
@davinchia could this be caused or aggravated by using an endpoint from the Configuration API? We make frequent requests to v1/web_backend/connections/get
and v1/web_backend/connections/update
, and throttling those seems to have improved the situation.
Platform Version
0.50.34
What step the error happened?
Other
Relevant information
I'm hosting Airbyte in Docker and no jobs can be run. From the UI, attempting to trigger a sync, I get:
If I try to run a test connection job, I get:
Syncs with the Airflow
AirbyteTriggerSyncOperator
are also not working (output in logs).The webapp seems to be ok, though, as I can see sources, destinations and connections.
From the attached logs, there seems to be something wrong with the Temporal service, where it seems to be stuck somehow. Running
docker container stats
shows that the temporal container is consuming a lot of CPU. Also, I can see that no job containers for connection checks or syncs are ever created.I am not sure if the attached logs are sufficient; they are simply the first errors I could catch from the containers.
Relevant log output