airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.78k stars 3.8k forks source link

`airbyte-cron`: RESOURCE_EXHAUSTED namespace rate limit exceeded #30691

Open TimothyZhang7 opened 9 months ago

TimothyZhang7 commented 9 months ago

Topic

Temporal issue

Revelant information

Airbyte version: 0.50.21

We are observing abnormal amount of rate limit errors from airbyte-cron. We are not using airbyte schedulers, only one cron job is setup on the Airbyte UI.

The following error message is emitted every few seconds as soon as we start the docker compose.


io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.54.0.jar:1.54.0]
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.54.0.jar:1.54.0]
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.54.0.jar:1.54.0]
    at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:4011) ~[temporal-serviceclient-1.17.0.jar:?]
    at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.java:127) ~[io.airbyte-airbyte-commons-temporal-0.50.21.jar:?]
    at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.java:105) ~[io.airbyte-airbyte-commons-temporal-0.50.21.jar:?]
    at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.java:40) ~[io.airbyte-airbyte-cron-0.50.21.jar:?]
    at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source) ~[io.airbyte-airbyte-cron-0.50.21.jar:?]
    at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:371) ~[micronaut-inject-3.9.4.jar:3.9.4]
    at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:76) ~[micronaut-inject-3.9.4.jar:3.9.4]
    at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$process$5(ScheduledMethodProcessor.java:127) ~[micronaut-context-3.9.4.jar:3.9.4]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
    at java.lang.Thread.run(Thread.java:1589) ~[?:?]```
TimothyZhang7 commented 9 months ago

Somehow this problem goes away by deleting temporal and temporal_visibility database in the postgres db created by airbyte deployment and restart the instance with run-ab-platform.sh script. Not sure if it is definitive but worth a try if you run into the same problem

marcosmarxm commented 8 months ago

Similar discussion: https://github.com/airbytehq/airbyte/discussions/30472

joeybenamy commented 4 months ago

We experienced this issue as well with Helm chart version 0.50.20 in multiple environments. Completing these steps resolved it for us:

  1. Helm uninstall Airbyte
  2. Delete the Airbyte namespace
  3. Delete the temporal and temporal_visibility databases (external Postgres)
  4. Reinstall Airbyte
killthekitten commented 2 months ago

@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:

What might be the downsides of @joeybenamy's approach with deleting the temporal databases?

TimothyZhang7 commented 2 months ago

@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:

  • The airbyte-worker service went into a reboot loop last Friday
  • The logs are never rotated and quickly use up all of the disk

What might be the downsides of @joeybenamy's approach with deleting the temporal databases?

After deleting the temporal databases, there is a chance of some running sync jobs getting stuck. More specifically, cannot be run or canceled. AFAIK you will have to reset the connector to fix it.

killthekitten commented 2 months ago

@TimothyZhang7 thanks! Actually, it went mostly ok. I saw a few log entries about a mismatch for some of the running sync job statuses, but it's been running smoothly ever since.

That said, we still have the same problem with log rotation, it didn't go away.

joeybenamy commented 2 months ago

@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:

  • The airbyte-worker service went into a reboot loop last Friday
  • The logs are never rotated and quickly use up all of the disk

What might be the downsides of @joeybenamy's approach with deleting the temporal databases?

After deleting the temporal databases, there is a chance of some running sync jobs getting stuck. More specifically, cannot be run or canceled. AFAIK you will have to reset the connector to fix it.

Yes, I should have mentioned that we don't do maintenance like this in Airbyte without stopping and pausing all syncs.

marcosmarxm commented 2 months ago

Hello all 👋 I reported this to the eng team. @joeybenamy are you still experiencing the issue?

joeybenamy commented 2 months ago

Hello all 👋 I reported this to the eng team. @joeybenamy are you still experiencing the issue?

We have not encountered this issue in quite some time. Thanks for checking!

walker-philips commented 1 month ago

@marcosmarxm What was the final recommendation/solution for fixing this issue? Or will an official solution be included in the next release?

sivankumar86 commented 1 month ago

@marcosmarxm I have upgraded to 0.60.0 but, I am still facing rate limit error

sivankumar86 commented 1 month ago

I increased some temporal config which i got it from temporal community and reduced number of workers (10 --> 3). Error disappeared

https://community.temporal.io/t/resource-exhausted-namespace-rate-limit-exceeded-for-cron-job/7583

   # when modifying, remember to update the docker-compose version of this file in temporal/dynamicconfig/development.yaml
    frontend.namespaceCount:
      - value: 4096
        constraints: {}
    frontend.namespaceRPS.visibility:
      - value: 100
        constraints: {}
    frontend.namespaceBurst.visibility:
      - value: 150
        constraints: {}
    frontend.namespaceRPS:
      - value: 76800
        constraints: {}
walker-philips commented 1 month ago

@sivankumar86 did you add these values to the ./temporal/dynamicconfig/development.yaml file? When I add these values airbyte fails to start correctly throwing a ton of "Failed to resolve name errors"

After upgrading to 0.60.0, we still encounter this, if its related to number of workers, here is our config: MAX_SYNC_WORKERS=10 MAX_SPEC_WORKERS=10 MAX_CHECK_WORKERS=10 MAX_DISCOVER_WORKERS=10 MAX_NOTIFY_WORKERS=5 SHOULD_RUN_NOTIFY_WORKFLOWS=true

sivankumar86 commented 1 month ago

@walker-philips I meant, replicas count. Find my conf file for reference if it helps. Verify using

k describe cm airbyte-oss-temporal-dynamicconfig # airbyte-oss name of deployment
worker:
  enabled: true
  replicaCount: 3
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ include "common.names.fullname" . }}-dynamicconfig
  labels:
    {{- include "airbyte.labels" . | nindent 4 }}
data:
  "development.yaml": |
    # when modifying, remember to update the docker-compose version of this file in temporal/dynamicconfig/development.yaml
    frontend.namespaceCount:
      - value: 4096
        constraints: {}
    frontend.namespaceRPS.visibility:
      - value: 100
        constraints: {}
    frontend.namespaceBurst.visibility:
      - value: 150
        constraints: {}
    frontend.namespaceRPS:
      - value: 76800
        constraints: {}
    frontend.enableClientVersionCheck:
      - value: true
        constraints: {}
    history.persistenceMaxQPS:
      - value: 3000
        constraints: {}
    frontend.persistenceMaxQPS:
      - value: 5000
        constraints: {}
    frontend.historyMgrNumConns:
      - value: 30
        constraints: {}
    frontend.throttledLogRPS:
      - value: 200
        constraints: {}
    history.historyMgrNumConns:
      - value: 50
        constraints: {}
    system.advancedVisibilityWritingMode:
      - value: "off"
        constraints: {}
    history.defaultActivityRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    history.defaultWorkflowRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    # Limit for responses. This mostly impacts discovery jobs since they have the largest responses.
    limit.blobSize.error:
      - value: 15728640 # 15MB
        constraints: {}
    limit.blobSize.warn:
      - value: 10485760 # 10MB
        constraints: {}
sivankumar86 commented 1 month ago

@walker-philips Could you restart the temporal pod after applying changes if you have not done yet ?

msenmurugan commented 1 month ago

@sivankumar86 Could you please explain on how to inject new key value pairs to the dynamicconfig temporal config map via Helm chart? I don't think, it is supported via the Helm chart.

sivankumar86 commented 1 month ago

@msenmurugan I download the helm chart and modify it before deploying in ci/cd pipeline .

lideke commented 1 week ago

@marcosmarxm any update on this issue ? we have similar issue each time we upgrade the airbyte version For now, i have to :