airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.75k stars 3.79k forks source link

[helm] worker pod is crashing after upgrading to 0.49.6+ (latest update: missing env variables. see reply for detail) #31988

Open sc-yan opened 8 months ago

sc-yan commented 8 months ago

What method are you using to run Airbyte?

Kubernetes

Platform Version or Helm Chart Version

helm 0.49.9

What step the error happened?

Upgrading the Platform or Helm Chart

Revelant information

when upgrading helm from 0.49.6 to 0.49.8/0.49.9, the worker pod keeps crashing. but if I reverted it back to 0.49.6, it's fine.

Relevant log output

2023-10-31 00:57:57 ERROR i.m.r.Micronaut(handleStartupException):338 - Error starting Micronaut server: Error instantiating bean of type  [io.airbyte.workers.orchestrator.KubeOrchestratorHandleFactory]                                                                 │
│                                                                                                                                                                                                                                                                            │
│ Path Taken: new ApplicationInitializer() --> ApplicationInitializer.syncActivities --> List.syncActivities([ReplicationActivity replicationActivity],NormalizationActivity normalizationActivity,DbtTransformationActivity dbtTransformationActivity,NormalizationSummaryC │
│ io.micronaut.context.exceptions.BeanInstantiationException: Error instantiating bean of type  [io.airbyte.workers.orchestrator.KubeOrchestratorHandleFactory]                                                                                                              │
│                                                                                                                                                                                                                                                                            │
│ Path Taken: new ApplicationInitializer() --> ApplicationInitializer.syncActivities --> List.syncActivities([ReplicationActivity replicationActivity],NormalizationActivity normalizationActivity,DbtTransformationActivity dbtTransformationActivity,NormalizationSummaryC │
│     at io.micronaut.context.DefaultBeanContext.resolveByBeanFactory(DefaultBeanContext.java:2367) ~[micronaut-inject-3.10.1.jar:3.10.1]                                                                                                                                    │
│     at io.micronaut.context.DefaultBeanContext.doCreateBean(DefaultBeanContext.java:2305) ~[micronaut-inject-3.10.1.jar:3.10.1]                                                                                                                                            │
│     at io.micronaut.context.DefaultBeanContext.doCreateBean(DefaultBeanContext.java:2251) ~[micronaut-inject-3.10.1.jar:3.10.1]                                                                                                                                            │
│     at io.micronaut.context.DefaultBeanContext.createRegistration(DefaultBeanContext.java:3016) ~[micronaut-inject-3.10.1.jar:3.10.1]                                                                                                                                      │
│     at io.micronaut.context.SingletonScope.getOrCreate(SingletonScope.java:80) ~[micronaut-inject-3.10.1.jar:3.10.1]                                                                                                                                                       
     at io.airbyte.workers.Application.main(Application.java:15) ~[io.airbyte-airbyte-workers-0.50.33.jar:?]                                                                                                                                                                │
│ Caused by: java.lang.IllegalArgumentException                                                                                                                                                                                                                              │
│     at com.google.common.base.Preconditions.checkArgument(Preconditions.java:131) ~[guava-31.1-jre.jar:?]                                                                                                                                                                  │
│     at io.airbyte.config.storage.DefaultS3ClientFactory.validateBase(DefaultS3ClientFactory.java:36) ~[io.airbyte.airbyte-config-config-models-0.50.33.jar:?]                                                                                                              │
│     at io.airbyte.config.storage.DefaultS3ClientFactory.validate(DefaultS3ClientFactory.java:31) ~[io.airbyte.airbyte-config-config-models-0.50.33.jar:?]                                                                                                                  │
│     at io.airbyte.config.storage.DefaultS3ClientFactory.<init>(DefaultS3ClientFactory.java:24) ~[io.airbyte.airbyte-config-config-models-0.50.33.jar:?]                                                                                                                    │
│     at io.airbyte.workers.storage.S3DocumentStoreClient.s3(S3DocumentStoreClient.java:59) ~[io.airbyte-airbyte-commons-worker-0.50.33.jar:?]                                                                                                                               │
│     at io.airbyte.workers.storage.StateClients.create(StateClients.java:27) ~[io.airbyte-airbyte-commons-worker-0.50.33.jar:?]                                                                                                                                             │
│     at io.airbyte.workers.config.ContainerOrchestratorConfigBeanFactory.kubernetesContainerOrchestratorConfig(ContainerOrchestratorConfigBeanFactory.java:91) ~[io.airbyte-airbyte-workers-0.50.33.jar:?]                                                                  │
│     at io.airbyte.workers.config.$ContainerOrchestratorConfigBeanFactory$KubernetesContainerOrchestratorConfig0$Definition.build(Unknown Source) ~[io.airbyte-airbyte-workers-0.50.33.jar:?]                                                                               │
│     at io.micronaut.context.DefaultBeanContext.resolveByBeanFactory(DefaultBeanContext.java:2354) ~[micronaut-inject-3.10.1.jar:3.10.1]                                                                                                                                    │
│     ... 81 more
msaffitz commented 8 months ago

We are seeing this error as well. Downgrading to 0.49.6 fixed the issue for us.

adamstrawson commented 8 months ago

We're also experiencing this within GKE on the latest version 0.49.10

lydialimsetel commented 8 months ago

Experiencing this same issue on latest V0.49.18. Downgraded to helm chart V0.49.5, works fine now.

cappadona commented 8 months ago

We're also running into this if we upgrade the chart past 0.49.6

joeybenamy commented 8 months ago

Same issue with all charts after 0.49.6

szemek commented 7 months ago

After providing some environment variables in values.yaml I made it work. Here's how part of my configuration looks like

  ##  worker.extraEnv [array] Additional env vars for worker pod(s).
  ## Example:
  ##
  ## extraEnv:
  ## - name: JOB_KUBE_TOLERATIONS
  ##   value: "key=airbyte-server,operator=Equals,value=true,effect=NoSchedule"
  extraEnv:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: AWS_ACCESS_KEY_ID
          name: airbyte-airbyte-secrets
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_SECRET_ACCESS_KEY
          name: airbyte-airbyte-secrets
    - name: STATE_STORAGE_S3_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_ACCESS_KEY_ID
          name: airbyte-airbyte-secrets
    - name: STATE_STORAGE_S3_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: AWS_SECRET_ACCESS_KEY
          name: airbyte-airbyte-secrets
    - name: STATE_STORAGE_S3_BUCKET_NAME
      value: ${STATE_STORAGE_S3_BUCKET_NAME}
    - name: STATE_STORAGE_S3_REGION
      value: ${STATE_STORAGE_S3_REGION}

Check here what environment variables you might be missing https://github.com/airbytehq/airbyte-platform/blob/9ffa4e9f44f06e65fe3b138204367d5da8c98f2c/airbyte-config/config-models/src/main/java/io/airbyte/config/EnvConfigs.java#L133-L142

sc-yan commented 7 months ago

@szemek thank you so much for the info! I followed your approach and and it worked! running 0.49.23 now. anyone who still has issues, please try the approach above. I'm gonna keep the issue open in case someone is looking for an answer. but please feel free to close it if you think no further action is needed.

prafulauto1 commented 7 months ago

for maintaining state with S3. I was able to resolve it by simply adding these two environment variable in woker section of values file : extraEnv:

I could find this here

HatemLar commented 7 months ago

Any idea on how to fix it on a EC2 deployment?

sc-yan commented 7 months ago

@HatemLar helm charts is supposed to be used in k8s. I assume you are deploying airbyte with docker/etc on EC2? trying to setup same env variables above following this guide. https://docs.airbyte.com/deploying-airbyte/on-aws-ec2

HatemLar commented 7 months ago

@sc-yan thank you for your help! Yes, deployed with docker on EC2, and we did follow that guide. You think we should declare these variables and in the instance or the docker-compose file?

sc-yan commented 7 months ago

@HatemLar it really depends on how you want to manage your infra/deployment. generally, docker is acting like VM so the app is not supposed to read values from host machine(which is EC2 in your case), unless you want to mount some volume into docker. it's common to setup these env variables in docker-compose file, but if you do have special cases, feel free to adjust so.

PurseChicken commented 7 months ago

Using the below works (for GCS) since the values should likely already be in your configMap if you specified them in global.gcs:

    extraEnv:
      - name: STATE_STORAGE_GCS_BUCKET_NAME
        valueFrom:
          configMapKeyRef:
            key: GCS_LOG_BUCKET
            name: airbyte-airbyte-env
      - name: STATE_STORAGE_GCS_APPLICATION_CREDENTIALS
        valueFrom:
          configMapKeyRef:
            key: GOOGLE_APPLICATION_CREDENTIALS
            name: airbyte-airbyte-env

That fixes the worker pod issue, however then I ran into the following with replication orchestrator. This is not seen until you try to sync a connection.

https://github.com/airbytehq/airbyte/issues/32203

lucasfcnunes commented 7 months ago

Using the below works (for GCS) since the values should likely already be in your configMap if you specified them in global.gcs:

    extraEnv:
      - name: STATE_STORAGE_GCS_BUCKET_NAME
        valueFrom:
          configMapKeyRef:
            key: GCS_LOG_BUCKET
            name: airbyte-airbyte-env
      - name: STATE_STORAGE_GCS_APPLICATION_CREDENTIALS
        valueFrom:
          configMapKeyRef:
            key: GOOGLE_APPLICATION_CREDENTIALS
            name: airbyte-airbyte-env

That fixes the worker pod issue, however then I ran into the following with replication orchestrator. This is not seen until you try to sync a connection.

32203

global.gcs.extraEnv doesn't affect the templates.

PurseChicken commented 7 months ago

Using the below works (for GCS) since the values should likely already be in your configMap if you specified them in global.gcs:

    extraEnv:
      - name: STATE_STORAGE_GCS_BUCKET_NAME
        valueFrom:
          configMapKeyRef:
            key: GCS_LOG_BUCKET
            name: airbyte-airbyte-env
      - name: STATE_STORAGE_GCS_APPLICATION_CREDENTIALS
        valueFrom:
          configMapKeyRef:
            key: GOOGLE_APPLICATION_CREDENTIALS
            name: airbyte-airbyte-env

That fixes the worker pod issue, however then I ran into the following with replication orchestrator. This is not seen until you try to sync a connection.

32203

global.gcs.extraEnv doesn't affect the templates.

What I wrote is specific to the worker key in values:

worker.extraEnv

ilyasemenov84 commented 7 months ago

Is there a way to make it work with IRSA authentication (as + iam role)?

marcosmarxm commented 7 months ago

Hello all 👋 sorry the missing update here. I shared this with the engineering team and any update return here.

booleanbetrayal commented 6 months ago

Just to note that this appears to be the same solution to remediate #18016.

raphaelauv commented 5 months ago

this work airbyte/worker:0.50.47 and helm chart 0.53.52


minio:
  enabled: false

worker:
  extraEnv:
    - name: STATE_STORAGE_S3_BUCKET_NAME
      value: "XXYYZZ"
    - name: STATE_STORAGE_S3_REGION
      value: "eu-west-3"
    - name: S3_MINIO_ENDPOINT
      value: ""

global:

  log4jConfig: "log4j2-no-minio.xml"
  state:
    storage:
      type: "S3"
  logs:
    storage:
      type: "S3"
    minio:
      enabled: false
    s3:
      enabled: true
      bucket: "XXYYZZ"
      bucketRegion: "eu-west-3"
    accessKey:
      existingSecret: "airbyte-aws-creds"
      existingSecretKey: "AWS_ACCESS_KEY_ID"
    secretKey:
      existingSecret: "airbyte-aws-creds"
      existingSecretKey: "AWS_SECRET_ACCESS_KEY"
StefanTUI commented 4 months ago

I can confirm, the settings from @raphaelauv helped me to start the worker pods again.

I'm using helm chart 0.53.120 with airbyte/server:0.50.48

The server pod runs, but had error messages like in the worker logs.

Adding the following in the in my yml helped to mitigate this:

server:
  extraEnv:
    - name: LOG4J_CONFIGURATION_FILE
      valueFrom:
        configMapKeyRef:
          name: airbyte-env
          key: LOG4J_CONFIGURATION_FILE
sg-danl commented 4 months ago

(Duplicate comment as previous issue is closed) I've been pinning version 0.49.6 to get around this for the past month and a half.
(Running Airbyte OSS on AWS EKS cluster, default values.yaml for ease of replication while trying to fix.)

Trying the fix suggested by @marcosmarxm doesn't fix for me. After attempting upgrading from 0.49.6 -> latest since mid Jan (so 0.50.22+) it has never fixed the issue.

Running the minio config in bash returns:


helm % kubectl exec -it airbyte-minio-0 bash -n default
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
bash-5.1# mc alias set myminio http://localhost:9000 minio minio123
mc: Configuration written to `/tmp/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/tmp/.mc/share`.
mc: Initialized share uploads `/tmp/.mc/share/uploads.json` file.
mc: Initialized share downloads `/tmp/.mc/share/downloads.json` file.
Added `myminio` successfully.
bash-5.1# mc mb myminio/state-storage
mc: <ERROR> Unable to make bucket `myminio/state-storage`. Your previous request to create the named bucket succeeded and you already own it.

Not an expert in any of this at all, but it looks like the creation of the bucket isn't entirely the issue. Just wanted to provide additional info as this has been a long-open issue!

Edited to add:
Force removing the bucket seems to (on 0.54.15) point to the bucket being forcefully recreated almost instantaneously.


bash-5.1# mc rb myminio/state-storage
mc: <ERROR> `myminio/state-storage` is not empty. Retry this command with ‘--force’ flag if you want to remove `myminio/state-storage` and all its contents 
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.
bash-5.1# mc mb myminio/state-storage
mc: <ERROR> Unable to make bucket `myminio/state-storage`. Your previous request to create the named bucket succeeded and you already own it.
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.
bash-5.1# mc rb myminio/state-storage --force
Removed `myminio/state-storage` successfully.

Edit again:
This only occurs with the PostgreSQL source connection. Our S3->S3 jobs can run as expected in versions beyond 0.49.6.