airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
14.72k stars 3.78k forks source link

[worker] Possibility to define custom resource requests for `discover` job #38991

Open ivan-sukhomlyn opened 3 weeks ago

ivan-sukhomlyn commented 3 weeks ago

Topic

worker config

Relevant information

Could you extend discover jobs configuration at the worker side in the same manner as for check jobs with the possibility to define custom resources instead of default ones?

The current behavior leads to overprovisioning of the Kubernetes cluster with such higher resource requests than usually needed for replication jobs

For example, the check job has this possibility - https://github.com/airbytehq/airbyte-platform/blob/main/airbyte-workers/src/main/resources/application.yml#L151

But there's no for discover ones - https://github.com/airbytehq/airbyte-platform/blob/main/airbyte-workers/src/main/resources/application.yml#L154

Proposal

  worker:
    kube-job-configs:
...
      check:
        annotations: ${CHECK_JOB_KUBE_ANNOTATIONS:}
        labels: ${CHECK_JOB_KUBE_LABELS:}
        node-selectors: ${CHECK_JOB_KUBE_NODE_SELECTORS:}
        cpu-limit: ${CHECK_JOB_MAIN_CONTAINER_CPU_LIMIT:}
        cpu-request: ${CHECK_JOB_MAIN_CONTAINER_CPU_REQUEST:}
        memory-limit: ${CHECK_JOB_MAIN_CONTAINER_MEMORY_LIMIT:}
        memory-request: ${CHECK_JOB_MAIN_CONTAINER_MEMORY_REQUEST:}
      discover:
        annotations: ${DISCOVER_JOB_KUBE_ANNOTATIONS:}
        labels: ${DISCOVER_JOB_KUBE_LABELS:}
        node-selectors: ${DISCOVER_JOB_KUBE_NODE_SELECTORS:}
        cpu-limit: ${DISCOVER_JOB_MAIN_CONTAINER_CPU_LIMIT:}
        cpu-request: ${DISCOVER_JOB_MAIN_CONTAINER_CPU_REQUEST:}
        memory-limit: ${DISCOVER_JOB_MAIN_CONTAINER_MEMORY_LIMIT:}
        memory-request: ${DISCOVER_JOB_MAIN_CONTAINER_MEMORY_REQUEST:}
marcosmarxm commented 3 weeks ago

Thanks for the request @ivan-sukhomlyn I included to the platform team backlog.

@davinchia now without the limit of reading large catalog maybe this is something necessary to make possible to now OOM during the discover schema.