Closed timuthy closed 2 years ago
/assign @ialidzhikov
Thank you @timuthy for opening this issue.
The avg memory usage now seem to be stable ~80Mb with spikes to 100Mb. These numbers are from a landscape with ~10000 Secrets and ~7000 SecretBindings.
I can give a try for
If 1.) is not feasible, use API Reader to read Secrets because we expect that Secrets are the most occurring resource kinds in the Garden cluster.
and see whether it improves the memory usage.
I had a look today into this issue and tried to outline future improvements.
I checked whether the admission-gcp component is using application/vnd.kubernetes.protobuf
content type. According to the protobuf proposal, using protobuf can decrease the CPU, memory and network usage:
Experiments have demonstrated a 10x reduction in CPU use during serialization and deserialization, a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount of objects created on the heap during serialization.
But it seems that admission-gcp is already using the protobuf content type today (I think, by chance?) thanks to the init()
in the github.com/gardener/gardener/pkg/client/kubernetes
pkg that calls apiutil.AddToProtobufScheme(protobufSchemeBuilder.AddToScheme)
:
Another possible option that theoritically should reduce the memory usage is to use the label/field selector on controller-runtime cache (ref https://github.com/kubernetes-sigs/controller-runtime/pull/1435) and to filter Shoots by .spec.provider.type
. This would require require gardener-apiserver to support field selector on .spec.provider.type
.
My last change on this issue was to disable the cache on Secrets (https://github.com/gardener/gardener-extension-provider-gcp/pull/253#issuecomment-814953825) and it had a possitive effect by reducing the memory usage to ~ avg 118Mi
. When I checked today the memory usage on a busy landscape it was again ~ avg 224 Mi
which is surprising for me as I don't think something has changed on admission-gcp side code-wise.
Another possible option that theoritically should reduce the memory usage is to use the label/field selector on controller-runtime cache (ref ✨ Add SelectorsByObject option to cache kubernetes-sigs/controller-runtime#1435) and to filter Shoots by .spec.provider.type. This would require require gardener-apiserver to support field selector on .spec.provider.type.
I guess this won't hurt but in the end I don't think that it will save much memory, especially if we use protobuf, right?
My last change on this issue was to disable the cache on Secrets (#253 (comment)) and it had a possitive effect by reducing the memory usage to ~ avg 118Mi. When I checked today the memory usage on a busy landscape it was again ~ avg 224 Mi which is surprising for me as I don't think something has changed on admission-gcp side code-wise.
Does it make sense to use profiling (https://github.com/gardener/gardener/pull/4568) here?
I guess this won't hurt but in the end I don't think that it will save much memory, especially if we use protobuf, right?
Hmm, why do you think so? The shoot resource can grow quite large and has to be stored somewhere in memory, which is independent of the used transport protocol.
Filtering out all irrelevant shoots (i.e., spec.provider.type!=gcp
), will still save quite some memory, I assume.
Am I missing something?
Hmm, why do you think so?
I checked the rough size of shoots on our biggest landscape and yes, it's a number but I cannot deduce that this will save memory in the ~100Mi
range. Of course, you cannot really compare a json representation size with the size that's needed in memory, but still I should give a direction, right?
Please note, that I never claimed filtering out the shoots can be disregarded. I was rather wondering if it will save a considerable amount of memory increase that we have experienced since quite a while.
Got your point. Well, I guess we just have to try and see 😄
After more thoughts on how to tackle this issue, I came up with the following proposal:
gardenlet to maintain a label (provider.shoot.gardener.cloud/<type>=true
) on the cloud provider Secret. gardenlet will add this label to the cloud provider Secret when the Secret is used by at least 1 Shoot from the given provider type. And respectively, gardenlet will remove the label when there are no more Shoots from the given provider that reference the Secret. This label will allow admissions to specify an objectSelector in their webhooks that match only Secrets related to Shoots from the given provider. In this way the admission components will no longer need to have to list (and have a cache resp.) SecretBindings and Shoots to determine whether the Secret is used by any Shoot. Reference POC/implementation can be found in https://github.com/ialidzhikov/gardener/commit/37b9528da8400e5d4d030d900eedadc50806cb1a and https://github.com/ialidzhikov/gardener-extension-provider-gcp/commit/18c5920afe7de54dff0688e7324dcc78fd3744c9.
This approach has one drawback. End user can bypass the validation by the webhook by first removing the label on the Secret and then updating the Secret by invalid data. But having in mind that the validation currently is kind of best effort (mainly checking for the required fields and if possible, for the data itself), I think this approach should be generally fine and would help us in most of the cases. If we have concerns about this drawback, we can add additional "prerequisite checks" for the cloudprovider Secret itself in the extension controllers and flag missing/invalid fields in the Secret data with the appropriate error code.
Despite the drawback I like the proposal. Of course, second opinions are always welcome. 👀
Thanks @ialidzhikov! One comment:
gardenlet to maintain a label (provider.shoot.gardener.cloud/
=true) on the cloud provider Secret. gardenlet will add this label to the cloud provider Secret when the Secret is used by at least 1 Shoot from the given provider type. And respectively, gardenlet will remove the label when there are no more Shoots from the given provider that reference the Secret.
This should probably be the GCM since it's in no way seed-specific and can be handled centrally.
This should probably be the GCM since it's in no way seed-specific and can be handled centrally.
I thought about it but faced the following problem. If it is a new controller in GCM that is watching Shoots and maintaining the cloud provider Secret label, then when GCM is down, GCM will lose track of the Shoots that were deleted meanwhile (=> fails to remove the label from the Secret). Of course, this can be handled by adding another finalizer to the Shoot metadata but I was not quite sure whether we want to do this. I can think again and try to find something better..
Good point, indeed. Yeah, the additional finalizer sounds not optimal. Still, I can imagine that having such logic in gardenlet might lead to other challenges w.r.t. the seed authorization and multiple gardenlets chasing for the same secret, or? I haven't thought about it much, but from a first gud feeling putting such logic in GCM would feel more natural (if we find good ways to overcome the mentioned problems, of course). Let's see whether there'll be ideas about it.
+1 for the proposal from @ialidzhikov, I like it :) and +1 for putting the controller into GCM
then when GCM is down, GCM will lose track of the Shoots that were deleted meanwhile
We should be able to construct a controller that does not only rely on watch events (controllers generally should be able to compensate for restarts) even without using finalizers. The controller should build a current state of the world (CSW) triggered by watch events (also received on startup). Then it can calculate the desired state of the world (DSW), i.e. which secret should have what labels. A secret add watch event will trigger a reconciliation for every secret in the system on startup, which will decide based on DSW whether to add or remove which labels.
When I go one step back, I think that the issue boils down to the direction that there is no "free" (by free I mean without additional requests/watches) way to determine the provider type of cloud provider Secret. Currently it is really nice that the provider type is part of the Shoot and CloudProfile specs. This allows extension admission controllers to pick only the requests that match the provider type. I think that the part contributed to the memory increase in the admission controller is the part that determines the cloud provider Secret type and in general whether the Secret is in use by a Shoot from the given provider type (of course, we can try to enable profiling as suggested by @timuthy to confirm/reject this). If SecretBinding is used only to reference a cloud provider Secret (currently I cannot point other use case), then in ideal world we can have the provider type as part of the SecretBinding spec:
apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
name: my-provider-account
namespace: garden-dev
provider:
type: gcp
secretRef:
name: my-provider-account
Pros:
cloudprofile.garden.sapcloud.io/name
label to SecretBinding on Secret creation from the dashboard, then it relies on this label to show the Secret and its type + CloudProfile name. For example, if you remove this label from the SecretBinding, the dashboard will stop showing this Secret. More details can be found in https://github.com/gardener/dashboard/issues/781. In this case dashboard can also show the SecretBinding and use its provider.type.Cons:
Not clear how to introduce such change in backwards compatible manner. And not clear how to tackle potential cases where single Secret and SecretBinding is used for multiple cloud providers.
Implementations proposal:
provider.type
field to the SecretBinding resource.provider.type
.
Makes the SecretBinding resource coupled to cloud provider Secret and "denies" potential usage of the SecretBinding resource for non-cloudprovider Secrets.
corev1.SecretReference
insteadI updated my previous comment https://github.com/gardener/gardener-extension-provider-gcp/issues/143#issuecomment-916010872 with implementation proposal on how to introduce a provider.type
on SecretBinding. The proposal itself consists of several stages to make sure that the backwards incompatible API change is rolled out in graceful and controlled manner.
Let me know what you think in general about the idea to make the SecretBinding resource provider-specific. Let me know also what you think about the proposal.
I like the proposal @ialidzhikov!
And not clear how to tackle potential cases where single Secret and SecretBinding is used for multiple cloud providers. If the SecretBinding is used by multiple provider types, generate appropriate event to the SecretBinding and do not set the provider type field. For such cases Gardener operators can contact the end user and resolve the case by splitting the SecretBinding.
Could we use a list of providers instead of only one?
Introduce validation for the newly introduced field to contain only valid values.
What are valid values?
Adapt the exising SecretBinding controller to maintain its provider type (if set) as label in the referenced Secret.
How would the label be named?
Could we use a list of providers instead of only one?
My current assumption is that in general it is not a good practice to reuse single Secret. At least I would vote for separation of concerns and in a personal setup I wouldn't put such sensitive data into single Secret. Let me know if there are cases where we recommend this approach. I agree with you in general that it is kind of an incompatible change. On the other side we can use the opportunity to impose the good practice if we really have only exceptional cases of misuse. So, I better check how many folks depend on this reuse of single cloud provider Secret for multiple provider types.
What are valid values?
Good point. Initially I had in mind to check whether the provider type is a registered one (similar to what we do for Shoot and Seed). But I see that we allow for example CloudProfile to be created with non-registered provider type. We better follow the CloudProfile approach and do not introduce a validation for this. Another validation that we can introduce is whether the Shoot provider.type matches the SecretBinding provider.type.
How would the label be named?
I had in mind provider.shoot.gardener.cloud/<type>=true
, but open for suggestions of course.
/component gardener /roadmap internal /milestone 2021-Q4
With https://github.com/gardener/gardener-extension-provider-gcp/pull/396 on a not that busy landscape the avg memory utilization of admission-gcp dropped from ~40MiB to ~21MiB.
Let's see what will be the effect for busy landscapes where admission-gcp memory usage is > 200MiB.
On a large landscape with the rollout of admission-gcp@v1.22.0
the memory usage dropped significantly - from 200-250MiB to 40-60MiB:
We can say that the issue is fixed for admission-gcp. However I will keep it open to track the progress for other other admission components that have to be adapted in a similar way.
Very nice, well done @ialidzhikov
The other admission components are adapted in a similar way. Hence, we can close this issue.
/close
How to categorize this issue?
/area robustness /area cost /priority critical /platform gcp
What would you like to be added: Since the validation of cloud provider secrets has been introduced (#112) the GCP admission plugin's memory footprint increased for a considerable amount, mainly because of added caches for
Shoots
,SecretBindings
andSecrets
.The required memory depends on the K8s cluster or the number/size of the stored resources, but we've observed increases from from ~20Mb to ~500Mb. Under consideration that such an admission component has multiple replicas and is only responsible for one provider, the runtime costs are too high.
Hence, we should try to:
Secrets
because we expect thatSecrets
are the most occurring resource kinds in the Garden cluster.