airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.97k stars 4.1k forks source link

[helm] Excessive cpu and memory limits in job containers #35749

Open niclasgrahm opened 7 months ago

niclasgrahm commented 7 months ago

Helm Chart Version

0.50.48

What step the error happened?

During the Sync

Revelant information

Hello!

I am running the helm chart on a bare metal kubernetes cluster. My problem is that when a destination-<xxx> pod starts, the resource limits on some of the containers in that pod are set too high. I cannot see a way for me to configure these values, either in the values.yaml, environment variables or similar.

Specifically, when a destination pod starts, it has 5 containers in it:

The remote-stdin container has a cpu limit of 2! See attached screenshot. For my use case, this is unacceptably high.

As far as I can see, I can manipulate the limits and requests for the main container in the values.yaml, but not for the other four containers.

Relevant log output

No response

rowanmoul commented 7 months ago

I have the same issue. I have Airbyte running on a dedicated 4-core 16GB RAM kubernetes node, but then the source and/or dest pods are created with a CPU request of 1.0 each, and one or other of them can't be scheduled because there isn't enough resources available. Would the whole process fail if it was CPU constrained? This is an ETL tool, not a real-time data streaming tool. I don't need it to be fast, I just need it to be able to run at all. For now, I have worked around this issue by allowing Azure Kubernetes to scale the dedicated node pool up to two nodes during a sync (it will auto-scale back down ~10 mins after), but this sort of scaling action should be the exception, not the rule, and this only works for us because we sync once a day, not once an hour.

TheStanHo commented 7 months ago

I am also having the same issue even when setting the global.jobs.resources settings for requests and limits. These values.yaml set the JOB_MAIN_CONTAINER_CPU_REQUEST/JOB_MAIN_CONTAINER_MEMORY_REQUEST environmental variables on the job pods. But it seems to have no effect on the job pods as mentioned.

Like Rowanmoul mentioned we could set the autoscaling on the nodepool but it is not a long term solution. Especially if sync jobs are running often.

TheStanHo commented 7 months ago

Hey @rowanmoul / @niclasgrahm I noticed that when I moved my deployment to another nodepool those values in the helmchart values.yaml for global.jobs.respurces.requests Which sets the below environmental variables JOB_MAIN_CONTAINER_CPU_REQUEST/JOB_MAIN_CONTAINER_MEMORY_REQUEST

were picked up. So I'm guessing you just need to restart the deployment again and run the sync jobs to see if it also works for you. I could finally see those environmental variables were set on the destination-, source- and orchestrator- pods. And also were picked up by the resources, when i viewed the deployment.yaml for them.

joeybenamy commented 7 months ago

I always helm uninstall airbyte before deploying new values because I've had so many issues with values not taking. Our Airbyte postgres is external so we don't lose anything important by helm uninstalling airbyte each time. I've avoided a lot of headaches by doing this.

tautvydas-v commented 4 months ago

Hey @TheStanHo @rowanmoul - how exactly was this configuration picked up? I'm deploying Airbyte on Kubernetes myself, but the main issue is with the same containers specified in the issue. Setting JOB_MAIN_CONTAINER_CPU_REQUEST or JOB_MAIN_CONTAINER_CPU_LIMIT doesn't do much and doesn't overwrite the pods' configuration which is 2 CPUs each, in total being consuming over 6 CPUs, which is an overkill. Did you have any luck with any kind of other deployment method? Also, helm uninstall doesn't helm - the outcome is the same.

TheStanHo commented 4 months ago

Hey @tautvydas-v

global:
  jobs:
  resources:
    requests:
       memory: 256Mi
       cpu: 250m
    limit:
       memory: 2Gi
       cpu: 2

Think you misunderstood my comment, if you set the above in the values.yaml (the indentation I did might be wrong so double check). And then uninstall and reinstall It should then set the environmental variables you see on the Pod for JOB_MAIN_CONTAINER_CPU_REQUEST JOB_MAIN_CONTAINER_CPU_LIMIT

tautvydas-v commented 4 months ago

Hey @TheStanHo , thanks for a quick reply!

Thanks for clarification. This is what I have setup now too, where I have:

global:
  jobs:
    resources:

      requests:
        cpu: 100m
        memory: 256Mi

      limits:
        cpu: 100m
        memory: 1Gi

But even then, I can see that Socat container has a limit of 2 CPUs, which creates three containers: relay-stderr, relay-stdout and call-heartbeat-server. I tried setting up SOCAT_KUBE_CPU_REQUEST and SOCAT_KUBE_CPU_LIMIT env_var and it managed to work for relay-stdout where it has now a limit of 0.1 CPU, but still relay-stderr and call-heartbeat-server has limit of 2 CPUs, which is still a lot. Maybe you had the same issue?

TheStanHo commented 4 months ago

Hey @tautvydas-v , Not sure I can help much there don't think I remember using socat or having issues with it. The issue had raised with this git issue was just the jobs container limit/requests not getting picked up or being set but once I did the above it worked for me and I didn't have any other issues prop up

rowanmoul commented 4 months ago

I never tried to set any limits. I just let the auto-scaling do it's thing. The anecdote about values not being applied without a complete uninstall would seem to ring true for me too, though I haven't done any direct testing, that could explain some of the issues that I have seen in the past that were magically resolved when I replaced the install for unrelated reasons. We also use an external postgres server, which does make things a lot easier.

RaymondvdW-AB commented 4 months ago

I have the same issue. Both the source and destination pods have a too high CPU request. I am able to change the CPU request of the main containers using the environment variables in the values.yaml, but not of the stdin and stdout containers, which each have a CPU request of 500m. If only there was a way to configure these values.

FransDel commented 3 months ago

Had the same issue which i created here i think they are similar : https://github.com/airbytehq/airbyte/issues/34897

rowanmoul commented 1 month ago

The latest update on this issue: Now that sync jobs are run as a single pod (with three containers) rather than separate pods for source and dest, the total CPU requested by the pod is 4 (4000mCPU), which can't be scheduled even on a new 4 core node (since the amount available to pods is 3860mCPU). Now I'm going to be forced to try to set some limits here...

rowanmoul commented 1 month ago

I always helm uninstall airbyte before deploying new values because I've had so many issues with values not taking. Our Airbyte postgres is external so we don't lose anything important by helm uninstalling airbyte each time. I've avoided a lot of headaches by doing this.

I figured out why this is needed. The workload resource request values are set in a config map, and then the values are mapped to environment variables on the various airbyte pods. The problem is that changing the configmap doesn't cause kubernetes to re-create the pods with updated environment variable values. Manually deleting each of the pods and allowing kubenetes to re-create them solved the issue without resorting to a full re-install.

I have the same issue. Both the source and destination pods have a too high CPU request. I am able to change the CPU request of the main containers using the environment variables in the values.yaml, but not of the stdin and stdout containers, which each have a CPU request of 500m. If only there was a way to configure these values.

It looks like the values set in global.jobs.resources now apply to all containers in the newly combined "replication" pod, which contains containers for orchestration, source, and dest.

I wouldn't consider this issue resolved until high resource requests in the default state is addressed though.