airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.3k stars 4.15k forks source link

[helm] Job tolerations are ignored #45903

Open joeybenamy opened 2 months ago

joeybenamy commented 2 months ago

Helm Chart Version

1.0.0

What step the error happened?

On deploy

Relevant information

On prior versions of the Helm Chart, tolerations set in Helm values are properly propagated to the job pods. In the new version, the tolerations in Helm values are not added to the job pods. As a result, our jobs cannot be scheduled.

In Helm values:

global:
  jobs:
    kube:
      tolerations:
      - key: "usage"
        operator: "Equal"
        value: "airbyte"
        effect: "NoExecute"
      nodeSelector:
        usage: airbyte

From the job pods:

  nodeSelector:
    usage: airbyte
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Relevant log output

Pod airbyte/source-postgres-check-16279-1-tefxt can't be scheduled on eks-airbyte-uat-20240307203630429500000001-d4c70d50-dd55-3f1f-3a66-11baf39a636f, predicate checking error: node(s) had untolerated taint {usage: airbyte}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {usage: airbyte}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"usage", Value:"airbyte", Effect:"NoExecute", TimeAdded:}}

marcosmarxm commented 2 months ago

@joeybenamy what was the previous version you're using?

joeybenamy commented 2 months ago

@joeybenamy what was the previous version you're using?

0.344.2

marcosmarxm commented 2 months ago

@airbytehq/platform-deployments fyi

abuchanan-airbyte commented 2 months ago

This may have been fixed by ~https://github.com/airbytehq/airbyte-platform/commit/57319f7ebc8626ca93b600e6c593e78fd24a705d~ (oops, wrong link) https://github.com/airbytehq/airbyte-platform/commit/2ed01e554d576bd60011583ea988aeac8980f2f0

alexremn commented 1 month ago

Seems like duplicate of https://github.com/airbytehq/airbyte/issues/28389 @abuchanan-airbyte thank you, awaiting for the release!

joeybenamy commented 1 month ago

Seems like duplicate of #28389 @abuchanan-airbyte thank you, awaiting for the release!

Likewise. Thank you!

joeybenamy commented 1 month ago

Testing with Helm Chart 1.1.0 and Airbyte platform 1.1.0. Tolerations are still not present on job pods.

marcosmarxm commented 1 month ago

@abuchanan-airbyte and @tryangul fyi

joeybenamy commented 1 month ago

@abuchanan-airbyte and @tryangul fyi

Any update on this? Is this a Helm Chart issue or an Airbyte platform issue?

marcosmarxm commented 1 month ago

This is a work in progress @joeybenamy. Hope to get update EOW.

talha-naeem1 commented 3 weeks ago

I am also facing an issue with this. Can someone please confirm if it's fixed now?

AcidFlow commented 3 weeks ago

A fix has been merged to the default branch as far as I've seen, however this is not available yet as part of a release.

We internally built an image of workload-launcher from v1.1.0 with the fix cherry-picked. I can see the tolerations being propagated to the pod when using our custom image.

See: https://github.com/airbytehq/airbyte/issues/28389#issuecomment-2446514393

abuchanan-airbyte commented 3 weeks ago

You might try the latest nightly release version 1.1.0-dev-nightly-1730243169-7e1b11aeac (that's a helm chart version)

remisalmon commented 3 weeks ago

You might try the latest nightly release version 1.1.0-dev-nightly-1730243169-7e1b11aeac (that's a helm chart version)

Anyone tried this release version with setting global.jobs.kube.tolerations on a cluster where all nodes are tainted? Tried on both aws eks and a local kind cluster and cannot not get a "rce-postgres-check-" (new source) job pod to get scheduled on either.

joeybenamy commented 3 weeks ago

Just got an update from Airbyte:

We're working on setting up the 1.2.0 release candidate today. Not sure what the official release date is, but it will be soon. In the meantime, nightly releases are available

remisalmon commented 3 weeks ago

You might try the latest nightly release version 1.1.0-dev-nightly-1730243169-7e1b11aeac (that's a helm chart version)

Anyone tried this release version with setting global.jobs.kube.tolerations on a cluster where all nodes are tainted? Tried on both aws eks and a local kind cluster and cannot not get a "rce-postgres-check-" (new source) job pod to get scheduled on either.

I figured my issue with those jobs tolerations: the helm chart values expect the operator to be set explicitly, which should not be necessary: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ "The default value for operator is Equal."

This is coming from https://github.com/airbytehq/airbyte-platform/commit/2ca3c4192793b15a1ccc2bfd644dd725c3a2903c#diff-3555dc77946bb010495d4a97b3060553f759452f781e0f5b54b3c8a37394c3b0R227 that was linked in https://github.com/airbytehq/airbyte/issues/28389.

joeybenamy commented 2 weeks ago

In Airbyte 1.2.0 and Helm chart 1.2.0, this issue appears to be fixed, but now using S3 for logs and state seems to be broken: https://github.com/airbytehq/airbyte/issues/48407

So I'm still stuck.