Allow to change resource limits on pods deployed via the operator

snuggie12 commented 4 years ago

If you install the operator and create a configconnectors.core.cnrm.cloud.google.com you are unable to change the resource requests/limits on the stats recorder and other pods.

It would be nice to have this functionality in the CRD or if it already happens to exist in annotation to have the documented. Currently my stats recorder is OOM'ing. See #239 for others also having OOM issues.

jcanseco commented 4 years ago

Hi @snuggie12, thanks. We're currently looking into the OOM issue since we understand it's causing problems.

Regarding your request to add the ability to customize resource limits via the operator: we believe that ideally, users should not really have to think about setting these values themselves. We (the KCC team) should really be setting these values correctly so that our users would not have to worry about them at all. This way, we are able to keep the operator somewhat simple to use for our users. That said, if there is enough demand for this functionality, then we are open to revisiting the topic.

snuggie12 commented 4 years ago

@jcanseco I can appreciate the sentiment, but I think it should be considered for a few reasons:

You can already bypass this limitation by simply copying the resources, turning off the operator, and make the change.
Most other operators or helm charts or other packaging systems allow for this and much more.
If the OOM issue is a legit bug, having customization allows users to temporarily fix a problem while a solution can be worked on.

All of these reasons can lead to a customer either having a bitter sentiment towards the product (i.e. "Look at all these extra steps I have to do to get around a design decision from a closed-source product,") or worse not adopting the product at all.

jcanseco commented 4 years ago

Thanks @snuggie12, those are all great points. We've deliberated internally and have come to an agreement that we should work to support the ability to configure resource limits via the operator in the future. I have no details yet, but it's in our backlog. Thanks for elaborating on the details of your use-case!

Muni10 commented 3 years ago

@jcanseco Hi, we are also struggling with OOM in a few of the components created by the operator. Do you have any estimate on when this can by made available? Currently we are also considering removing the operator and configuring the resources ourselves.

jcanseco commented 3 years ago

Hi @Muni10, we probably won't be getting to this for awhile, so no estimates yet. Our current approach is to have KCC scale better in general and fix any OOM issues. If you have any OOM issues, please do report them. They will be prioritized as we consider such issues high priority.

If you cannot afford to wait for such fixes, then installing KCC without the operator is an option, but please note that this is no longer an installation method that we support.

josephhholmes commented 3 years ago

@jcanseco bumping this as we are running into issues with the operator version 1.32.0 where the cnrm-webhook-manager is scaled out to max (10) and cpu is running just high enough that it won't scale down. The webhook manager apparently seems to have the cpu set to .04 cpu request and limit? Technically 40m, but this seems very low, especially when the HPA is set to scale if the cpu is 40% (meaning, on average, 16m, or .016 of a cpu). We have 117 resources using the operator in our cluster. It would be great if we had a way to adjust the resources.

jcanseco commented 3 years ago

+@spew for visibility regarding the webhook issue.

Hi @josephhholmes, unfortunately we don't yet have support for overriding resource limits. One thing you could do as a workaround is to disable the operator so that you can then change the resource limits without the operator changing them back. You can disable the operator by setting the StatefulSet configconnector-operator to have 0 replicas.

spew commented 3 years ago

Hi @josephhholmes are you running in namespaced mode or cluster mode? If namespaced mode, for how many namespaces are under management by Config Connector (a namespace with a ConfigConnectorContext is under management).

emarcotte commented 3 years ago

@spew (I work with Joseph) we are in cluster mode.

toumorokoshi commented 3 years ago

Thanks for the info! In the case of the webhook, cluster / namespace shouldn't matter much, but it does affect the controller-manager instances (which are spun up 1 per namespace in namespace mode, otherwise a single one for cluster).

The webhook manager apparently seems to have the cpu set to .04 cpu request and limit? Technically 40m, but this seems very low, especially when the HPA is set to scale if the cpu is 40% (meaning, on average, 16m, or .016 of a cpu). We have 117 resources using the operator in our cluster. It would be great if we had a way to adjust the resources.

Agreed that the long term strategy is to enable better configuration via the operator. But a short-term fix may be just adjusting to the hard-coded default values in the operator to enable a higher maximum of 400ms CPU time.

I'll investigate both and give an update.

toumorokoshi commented 3 years ago

Hello,

we spoke about the issue and agreed that we'll raise the limits, or remove the CPU limits altogether. I'm working on a change to get this into the next version of the Config Connector operator release.

Unfortunately the turnaround for the addon getting this fix is 3-4 weeks. So for a quick workaround, a manual installation of the operator when the next version ships would be best.

toumorokoshi commented 3 years ago

As I working through this, I wanted to clarify an assumption: Config Connector currently considers hitting the webhook HPA limit to be a normal part of usage under load, rather than incorrect behavior.

The pod limits should be higher, but if Config Connector hits an HPA limit, it could still be operating normally until we start seeing webhook calls slowing down or failing.

So it's not recommended to set metrics and alert, for example, if the HPA hits it's maximum. Is the underlying issue raised more with the webhook not operating correctly? or trying to avoid an alert set on HPA maximums?

emarcotte commented 3 years ago

What classifies as under load? We observe that we basically we have this thing scales to max for days at end with only like 100 resources. Since I pinged the thread one of our envs has been in this state

It definitely is working but seems like we cant have a generic alert for the cluster about HPAs that need review for tuning if this guy always runs 100% if limit

toumorokoshi commented 3 years ago

What classifies as under load? We observe that we basically we have this thing scales to max for days at end with only like 100 resources. Since I pinged the thread one of our envs has been in this state

100 resources is certainly expected load, and the webhook shouldn't hit that limit with such a small number. To that end, we will be updating the resource limits. hoping to get that in by the next release.

It definitely is working but seems like we cant have a generic alert for the cluster about HPAs that need review for tuning if this guy always runs 100% if limit

I think it's reasonable to expect that the HPA don't hit the limit immediate after a small amount of usage. That said, our HPA will be set on CPU request limits, with no limit. And thus, although a saturated webhook will always be at the HPA limit, being at the HPA limit doesn't not always imply the webhook is operating improperly.

I just wanted to clarify that although we can do a best-effort to scale to the HPA max only when we are truly getting close to saturation, it's not an 100% reliable signal of an oversubscribed webhook.

Does that help?

toumorokoshi commented 3 years ago

Hi! as an update:

missed last release, but the new version of the operator that has increased requests and pod count for webhook will be in the 1.42.0 of KCC.

In my testing, the pods didn't scale past the minimum number of 2 (previously scaled up to 6 instances instantly).

Custom limits are still being prioritized, but hope this mediates the pain.

emarcotte commented 3 years ago

Sorry I apparently didn't reply to prev comment! Yes it does help and thanks for the update!

rwkarg commented 3 years ago

I'm seeing this occur after first enabling Config Connector. No GCP resources have yet been defined and the webhook pods are pegging their CPU limit and immediately scaling to max (10) instances. I would expect the CPU usage to be virtually zero if there are no resources to manage.

GKE version: 1.18.16-gke.302 cnrm.cloud.google.com/version: 1.39.0

toumorokoshi commented 3 years ago

Hi! @rwkarg! Can you upgrade your KCC version to 1.42? As mentioned above, that was the first version that overhauled the requests and limits to ease this significantly.

rwkarg commented 3 years ago

I'm not sure how to upgrade it as it's managed as an addon, though that doesn't seem to address the underlying excessive CPU usage. Is it expected that with zero GCP resources to manage that the cnrm-webhook-manager would be anything other than idle? What are the pods doing pegging their CPU limit for days with nothing to manage? If this is better as a separate issue (since there is no load in my case) then I can open a separate issue.

toumorokoshi commented 3 years ago

I'm not sure how to upgrade it as it's managed as an addon,

With the GKE add-on, you would either have to upgrade to a more aggressive upgrade channel or version (such as rapid) to pick up updates more quickly, or upgrade to the manual operator installation.

We're looking at some ways to bring newer versions to the add-on faster, but there's no ETA yet for that work.

Though that doesn't seem to address the underlying excessive CPU usage. Is it expected that with zero GCP resources to manage that the cnrm-webhook-manager would be anything other than idle? What are the pods doing pegging their CPU limit for days with nothing to manage?

Before version 1.43, Config Connector has very conservative limits for the webhook, with 40ms of CPU being the limit, and 40% of that (so 16ms of CPU) resulting in an auto-scale. In other words, it's very easy for the webhook to scale up.

That said, I haven't been able to reproduce webhook scaling limits being hit with no resources. With Config Connector 1.39 as an add-on on GKE 1.18.16-gke.502, I see near-zero CPU usage:

2021-04-09-095415_1593x282_scrot

With a fresh cluster, Config Connector won't work, since it requires feeding in credentials for a Google Service Account. I added a manifest, which does increase the CPU but not the limit:

2021-04-09-100512_1631x413_scrot

toumorokoshi commented 3 years ago

If this is better as a separate issue (since there is no load in my case) then I can open a separate issue.

I think a separate issue would be good! The title here doesn't really describe the issue you're having (CPU load with no usage).

paulrostorp commented 3 years ago

I just installed the manual operator on several small, non-gke, clusters a couple of weeks ago and now noticed my bill nearly doubled 😅 .

The resources requests for the deletiondefender and controller-manager are way too high in my experience. especially for memory, where usage never gets above 10%...

Resource config for these components should be configurable on the ConfigConnector CRD. Right now, I'm considering rolling back to v1.42 or finding an alternative for these small clusters I'm running 😞

toumorokoshi commented 3 years ago

I just installed the manual operator on several small, non-gke, clusters a couple of weeks ago and now noticed my bill nearly doubled sweat_smile .

Ouch! I'm sorry to hear about that.

Currently the scaling and resource limits of the controller are rigid in Config Connector. We have been working on some form of vertical scaling but haven't found a good solution yet.

Due to the currently rigid scaling, Config Connector's requirements are set to ensure decent performance on larger clusters. The current resource limits were actually tuned to support some larger installations when users had problems with the v1.42 limits.

Until we figure out a scaling solution, I'd recommend rolling back to an older version as you suggested. Apologies about the inconvenience here.

vitalii-buchyn-exa commented 1 year ago

hello community!

I hope this issue is not dead already :)

+1 to the request to allow to change resource requests/limits via operator CRD

we're now experiencing issue when defaults for cnrm-deletiondefender is not enough, which causing constant OOM of it we use manually installed config connector cnrm.cloud.google.com/version: 1.85.0

diviner524 commented 1 year ago

@vitalii-buchyn-exa Definitely, we are fully aware of this limitation. Actually we are currently working on a solution to allow users to configure resource limits. We should be able to introduce some related features in early Q2.

diviner524 commented 1 year ago

Just to give everyone a heads up, we've started rolling out new features for customizing resource limits starting from v1.106.0. We're also putting together public documentation to help you make the most of these cool additions!

dinvlad commented 3 months ago

Is there any public documentation for customizing resource limits and the number of replicas cnrm-controller-manager? I tried to adjust it like so, but I still see 10 (!) clones of cnrm-controller-manager-xxxyyyzzzaaa-0 StatefulSet.. Thank you!

GoogleCloudPlatform / k8s-config-connector

Allow to change resource limits on pods deployed via the operator #240