Reduce memory footprint for GCP admission component

timuthy commented 4 years ago

How to categorize this issue?

/area robustness /area cost /priority critical /platform gcp

What would you like to be added: Since the validation of cloud provider secrets has been introduced (#112) the GCP admission plugin's memory footprint increased for a considerable amount, mainly because of added caches for Shoots, SecretBindings and Secrets.

The required memory depends on the K8s cluster or the number/size of the stored resources, but we've observed increases from from ~20Mb to ~500Mb. Under consideration that such an admission component has multiple replicas and is only responsible for one provider, the runtime costs are too high.

Hence, we should try to:

Disable caching for all objects and analyse the consequences, i.e. is cost per admission request still acceptable
If 1.) is not feasible, use API Reader to read Secrets because we expect that Secrets are the most occurring resource kinds in the Garden cluster.

ialidzhikov commented 4 years ago

/assign @ialidzhikov

ialidzhikov commented 4 years ago

Thank you @timuthy for opening this issue.

The avg memory usage now seem to be stable ~80Mb with spikes to 100Mb. These numbers are from a landscape with ~10000 Secrets and ~7000 SecretBindings.

Screenshot 2020-08-28 at 11 50 41

I can give a try for

If 1.) is not feasible, use API Reader to read Secrets because we expect that Secrets are the most occurring resource kinds in the Garden cluster.

and see whether it improves the memory usage.

ialidzhikov commented 3 years ago

I had a look today into this issue and tried to outline future improvements.

I checked whether the admission-gcp component is using application/vnd.kubernetes.protobuf content type. According to the protobuf proposal, using protobuf can decrease the CPU, memory and network usage:

Experiments have demonstrated a 10x reduction in CPU use during serialization and deserialization, a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount of objects created on the heap during serialization.

But it seems that admission-gcp is already using the protobuf content type today (I think, by chance?) thanks to the init() in the github.com/gardener/gardener/pkg/client/kubernetes pkg that calls apiutil.AddToProtobufScheme(protobufSchemeBuilder.AddToScheme):

https://github.com/gardener/gardener/blob/53847edc6c7ffa7dfd4bf54ace3739a1d15c66c8/pkg/client/kubernetes/client.go#L52-L61
Another possible option that theoritically should reduce the memory usage is to use the label/field selector on controller-runtime cache (ref https://github.com/kubernetes-sigs/controller-runtime/pull/1435) and to filter Shoots by .spec.provider.type. This would require require gardener-apiserver to support field selector on .spec.provider.type.

My last change on this issue was to disable the cache on Secrets (https://github.com/gardener/gardener-extension-provider-gcp/pull/253#issuecomment-814953825) and it had a possitive effect by reducing the memory usage to ~ avg 118Mi. When I checked today the memory usage on a busy landscape it was again ~ avg 224 Mi which is surprising for me as I don't think something has changed on admission-gcp side code-wise.

Screenshot 2021-09-07 at 19 22 48

timuthy commented 3 years ago

Another possible option that theoritically should reduce the memory usage is to use the label/field selector on controller-runtime cache (ref ✨ Add SelectorsByObject option to cache kubernetes-sigs/controller-runtime#1435) and to filter Shoots by .spec.provider.type. This would require require gardener-apiserver to support field selector on .spec.provider.type.

I guess this won't hurt but in the end I don't think that it will save much memory, especially if we use protobuf, right?

My last change on this issue was to disable the cache on Secrets (#253 (comment)) and it had a possitive effect by reducing the memory usage to ~ avg 118Mi. When I checked today the memory usage on a busy landscape it was again ~ avg 224 Mi which is surprising for me as I don't think something has changed on admission-gcp side code-wise.

Does it make sense to use profiling (https://github.com/gardener/gardener/pull/4568) here?

timebertt commented 3 years ago

I guess this won't hurt but in the end I don't think that it will save much memory, especially if we use protobuf, right?

Hmm, why do you think so? The shoot resource can grow quite large and has to be stored somewhere in memory, which is independent of the used transport protocol. Filtering out all irrelevant shoots (i.e., spec.provider.type!=gcp), will still save quite some memory, I assume. Am I missing something?

timuthy commented 3 years ago

Hmm, why do you think so?

I checked the rough size of shoots on our biggest landscape and yes, it's a number but I cannot deduce that this will save memory in the ~100Mi range. Of course, you cannot really compare a json representation size with the size that's needed in memory, but still I should give a direction, right?

Please note, that I never claimed filtering out the shoots can be disregarded. I was rather wondering if it will save a considerable amount of memory increase that we have experienced since quite a while.

timebertt commented 3 years ago

Got your point. Well, I guess we just have to try and see 😄

ialidzhikov commented 3 years ago

After more thoughts on how to tackle this issue, I came up with the following proposal: gardenlet to maintain a label (provider.shoot.gardener.cloud/<type>=true) on the cloud provider Secret. gardenlet will add this label to the cloud provider Secret when the Secret is used by at least 1 Shoot from the given provider type. And respectively, gardenlet will remove the label when there are no more Shoots from the given provider that reference the Secret. This label will allow admissions to specify an objectSelector in their webhooks that match only Secrets related to Shoots from the given provider. In this way the admission components will no longer need to have to list (and have a cache resp.) SecretBindings and Shoots to determine whether the Secret is used by any Shoot. Reference POC/implementation can be found in https://github.com/ialidzhikov/gardener/commit/37b9528da8400e5d4d030d900eedadc50806cb1a and https://github.com/ialidzhikov/gardener-extension-provider-gcp/commit/18c5920afe7de54dff0688e7324dcc78fd3744c9. This approach has one drawback. End user can bypass the validation by the webhook by first removing the label on the Secret and then updating the Secret by invalid data. But having in mind that the validation currently is kind of best effort (mainly checking for the required fields and if possible, for the data itself), I think this approach should be generally fine and would help us in most of the cases. If we have concerns about this drawback, we can add additional "prerequisite checks" for the cloudprovider Secret itself in the extension controllers and flag missing/invalid fields in the Secret data with the appropriate error code. Despite the drawback I like the proposal. Of course, second opinions are always welcome. 👀

rfranzke commented 3 years ago

Thanks @ialidzhikov! One comment:

gardenlet to maintain a label (provider.shoot.gardener.cloud/=true) on the cloud provider Secret. gardenlet will add this label to the cloud provider Secret when the Secret is used by at least 1 Shoot from the given provider type. And respectively, gardenlet will remove the label when there are no more Shoots from the given provider that reference the Secret.

This should probably be the GCM since it's in no way seed-specific and can be handled centrally.

ialidzhikov commented 3 years ago

This should probably be the GCM since it's in no way seed-specific and can be handled centrally.

I thought about it but faced the following problem. If it is a new controller in GCM that is watching Shoots and maintaining the cloud provider Secret label, then when GCM is down, GCM will lose track of the Shoots that were deleted meanwhile (=> fails to remove the label from the Secret). Of course, this can be handled by adding another finalizer to the Shoot metadata but I was not quite sure whether we want to do this. I can think again and try to find something better..

rfranzke commented 3 years ago

Good point, indeed. Yeah, the additional finalizer sounds not optimal. Still, I can imagine that having such logic in gardenlet might lead to other challenges w.r.t. the seed authorization and multiple gardenlets chasing for the same secret, or? I haven't thought about it much, but from a first gud feeling putting such logic in GCM would feel more natural (if we find good ways to overcome the mentioned problems, of course). Let's see whether there'll be ideas about it.

timebertt commented 3 years ago

+1 for the proposal from @ialidzhikov, I like it :) and +1 for putting the controller into GCM

then when GCM is down, GCM will lose track of the Shoots that were deleted meanwhile

We should be able to construct a controller that does not only rely on watch events (controllers generally should be able to compensate for restarts) even without using finalizers. The controller should build a current state of the world (CSW) triggered by watch events (also received on startup). Then it can calculate the desired state of the world (DSW), i.e. which secret should have what labels. A secret add watch event will trigger a reconciliation for every secret in the system on startup, which will decide based on DSW whether to add or remove which labels.

ialidzhikov commented 3 years ago

When I go one step back, I think that the issue boils down to the direction that there is no "free" (by free I mean without additional requests/watches) way to determine the provider type of cloud provider Secret. Currently it is really nice that the provider type is part of the Shoot and CloudProfile specs. This allows extension admission controllers to pick only the requests that match the provider type. I think that the part contributed to the memory increase in the admission controller is the part that determines the cloud provider Secret type and in general whether the Secret is in use by a Shoot from the given provider type (of course, we can try to enable profiling as suggested by @timuthy to confirm/reject this). If SecretBinding is used only to reference a cloud provider Secret (currently I cannot point other use case), then in ideal world we can have the provider type as part of the SecretBinding spec:

apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
  name: my-provider-account
  namespace: garden-dev
provider:
  type: gcp
secretRef:
  name: my-provider-account

Pros:

We can forbid creation of SecretBinding that refrences invalid cloud provider Secret. Currently this validation happens on Shoot creation where we check whether the Shoot references a valid cloud provider Secret.
It can help us to tackle this issue. GCM can easily maintain the SecretBinding provider.type as label in the referenced Secret and as shared above extension admission controllers can use objectSelector to match only the Secret with the given provider type.
In general other components like the Gardener dashboard can benefit from such field. Currently while showing Secrets the Gardener dashboard is "naive" - it adds a cloudprofile.garden.sapcloud.io/name label to SecretBinding on Secret creation from the dashboard, then it relies on this label to show the Secret and its type + CloudProfile name. For example, if you remove this label from the SecretBinding, the dashboard will stop showing this Secret. More details can be found in https://github.com/gardener/dashboard/issues/781. In this case dashboard can also show the SecretBinding and use its provider.type.

Cons:

Not clear how to introduce such change in backwards compatible manner. And not clear how to tackle potential cases where single Secret and SecretBinding is used for multiple cloud providers.

Implementations proposal:
1. Add new optional provider.type field to the SecretBinding resource.
2. Introduce validation for the newly introduced field to contain only valid values.
3. Add new controller in GCM that will maintain the provider type of each SecretBinding based on its current usage.
  - If the SecretBinding is used by multiple provider types, generate appropriate event to the SecretBinding and do not set the provider type field. For such cases Gardener operators can contact the end user and resolve the case by splitting the SecretBinding.
4. Adapt existing clients to specify the new field.
  - Gardener dashboard - the dashboard can start setting the new field when it is running agaist Gardener API server >= vX.Y.Z
  - Other clients
5. Once the migration controller does its job and all SecretBindings are provider-specific, introduce a feature gate that forbids creating SecretBinding without provider.type.
  - This will allow Gardener operators to rollout the backwards incompatible validation in controlled manner.
  - Take care to promote the feature gate - Alpha -> Beta -> GA.
6. Adapt the exising SecretBinding controller to maintain its provider type (if set) as label in the referenced Secret.
7. Enhance our admission components to rely on the provider label in the referenced Secret.
Makes the SecretBinding resource coupled to cloud provider Secret and "denies" potential usage of the SecretBinding resource for non-cloudprovider Secrets.
- Can be workarounded by using corev1.SecretReference instead

ialidzhikov commented 3 years ago

I updated my previous comment https://github.com/gardener/gardener-extension-provider-gcp/issues/143#issuecomment-916010872 with implementation proposal on how to introduce a provider.type on SecretBinding. The proposal itself consists of several stages to make sure that the backwards incompatible API change is rolled out in graceful and controlled manner. Let me know what you think in general about the idea to make the SecretBinding resource provider-specific. Let me know also what you think about the proposal.

rfranzke commented 3 years ago

I like the proposal @ialidzhikov!

And not clear how to tackle potential cases where single Secret and SecretBinding is used for multiple cloud providers. If the SecretBinding is used by multiple provider types, generate appropriate event to the SecretBinding and do not set the provider type field. For such cases Gardener operators can contact the end user and resolve the case by splitting the SecretBinding.

Could we use a list of providers instead of only one?

Introduce validation for the newly introduced field to contain only valid values.

What are valid values?

Adapt the exising SecretBinding controller to maintain its provider type (if set) as label in the referenced Secret.

How would the label be named?

ialidzhikov commented 3 years ago

Could we use a list of providers instead of only one?

My current assumption is that in general it is not a good practice to reuse single Secret. At least I would vote for separation of concerns and in a personal setup I wouldn't put such sensitive data into single Secret. Let me know if there are cases where we recommend this approach. I agree with you in general that it is kind of an incompatible change. On the other side we can use the opportunity to impose the good practice if we really have only exceptional cases of misuse. So, I better check how many folks depend on this reuse of single cloud provider Secret for multiple provider types.

What are valid values?

Good point. Initially I had in mind to check whether the provider type is a registered one (similar to what we do for Shoot and Seed). But I see that we allow for example CloudProfile to be created with non-registered provider type. We better follow the CloudProfile approach and do not introduce a validation for this. Another validation that we can introduce is whether the Shoot provider.type matches the SecretBinding provider.type.

How would the label be named?

I had in mind provider.shoot.gardener.cloud/<type>=true, but open for suggestions of course.

rfranzke commented 2 years ago

/component gardener /roadmap internal /milestone 2021-Q4

ialidzhikov commented 2 years ago

With https://github.com/gardener/gardener-extension-provider-gcp/pull/396 on a not that busy landscape the avg memory utilization of admission-gcp dropped from ~40MiB to ~21MiB.

Screenshot 2022-03-23 at 8 59 37

Let's see what will be the effect for busy landscapes where admission-gcp memory usage is > 200MiB.

ialidzhikov commented 2 years ago

On a large landscape with the rollout of admission-gcp@v1.22.0 the memory usage dropped significantly - from 200-250MiB to 40-60MiB:

Screenshot 2022-05-12 at 18 13 52

We can say that the issue is fixed for admission-gcp. However I will keep it open to track the progress for other other admission components that have to be adapted in a similar way.

rfranzke commented 2 years ago

Very nice, well done @ialidzhikov

ialidzhikov commented 2 years ago

The other admission components are adapted in a similar way. Hence, we can close this issue.

/close

gardener / gardener-extension-provider-gcp

Reduce memory footprint for GCP admission component #143