Closed gardener-robot-ci-1 closed 6 years ago
Comment by mliepold Tuesday Nov 07, 2017 at 10:39 GMT
Currently every user or project owner has to bring its own secret(s) to create clusters. In order to increase adoption of the Gardener, we want to introduce trial clusters. But for these we need to introduce quotas to be able to limit the amount of created clusters and resources and thus the possible costs for a potential sponsor.
The quotas/limits need to be set for (account) secrets belonging to a Gardener project. A project is represented by a K8S namespace "garden-projectName" (with a label "garden.sapcloud.io/role: project").
In order to avoid the need to create specific trial projects and to copy trial secrets, secrets need to be referable from different namespaces (i.e. projects). Also individual projects that use a trial secret should only be allowed to use a certain share of the overall quota associated with it. Thus it is needed to set quotas on project level too.
By default every project using a trial quota should have a predefined quota. But projects that have special needs can request an increase of their quota.
Also the created trial cluster should automatically be cleaned up after a defined amount of days.
Metrics should be set for the most important chargeable resources of a cloud provider. Initially we specify the following metrics:
The Gardener UI will not support the creation of quotas (in the beginning). It has to be done via the K8S CLI for already existing secrets. For this a new custom resource "Quota" of the gardener is used.
apiVersion: garden.sapcloud.io/v1beta1
kind: Quota
metadata:
name: trial-aws-secret-quota
namespace: garden-trial
spec:
scope: secret
clusterLifetimeDays: 14
metrics:
cpus: "200"
gpus: "20"
memory: 4000Gi
disk.basic: 8000Gi
disk.standard: 8000Gi
disk.premium: 2000Gi
loadbalancers: "100"
Specification:
Note: The metric for cpus uses the worst-case scenario with the maximum possible CPUs per cluster in the beginning, if auto scaling is set for clusters. This means even if the actual amount of CPUs is not reached yet, the creation of new clusters could not be possible anymore! It ensures that the total costs can never be higher than defined by the quota. But the operator of the week should monitor the actual amount of CPUs for a secret to possibly adjust the numbers.
The quota can be set for the Gardener resource Secret specified by:
It is responsible for updating the status of all defined quotas. To do this, it has to monitor all existing shoot clusters which
The controller also regularly checks all existing resources of kind "Shoot" for the annotation "quota.garden.sapcloud.io/creationTime". If the current date is after this date plus the expiration time of the project quota, the gardener triggers the deletion of this shoot cluster. The user will get no notification before or when the cluster is deleted!
It checks the current status of the quotas that are used during the creation of a shoot cluster. If I quota would be exceeded after the cluster creation, the creation will be prevented and the user gets an appropriate error message.
For every secret in a project namespace, a PrivateSecretBinding is created. It binds the secret with a possible quota for the secret. The referenced secret and quota have to be in the same namespace. Also for the Garden projects that can use a trial secret, a CrossSecretBinding is created in the project namespace. It binds the trial secret with the secret quota and the (default) project quota. These referenced objects reside in the trial namespace.
At the secret selection during the shoot creation, the Gardener UI will use all secret bindings (PrivateSecretBinding and CrossSecretBinding) in the current project.
shoots.machinetypes: ["m4.large", "m4.xlarge"]
Comment by mliepold Friday Nov 17, 2017 at 11:11 GMT
Remarks/proposals during Weekly Kube:
Responses (decided during Operator Sync meeting):
Comment by mvladev Monday Nov 20, 2017 at 11:26 GMT
I'm still searching for the google doc for generic resources. Extended resources can provide a nice starting point on how to setup API types.
This logic should not be part of the garden-controller at all. This is a job for a External Admission WebHook or part of a custom API server. Shoot resources should not be created in the API server, if they are not accepted by the admission plugins.
Comment by rfranzke Monday Nov 20, 2017 at 12:00 GMT
Good points, external admission checks seem to be the best way of doing that.
Comment by vlerenc Tuesday Nov 21, 2017 at 04:42 GMT
Whoa, what a relief. Thank you @i068969 for the admission check proposal. That sounds much more in line with Kubernetes, hence I like it very much. Also thank you for the extended resources idea. I am not certain whether it can be leveraged, but also here my hope is, we find a well integrated mechanism for quotas. After all, that's kind of natural to expect from something like Kubernetes.
The other design aspect I am concerned with is the complexity with the shared secrets:
Comment by vlerenc Thursday Nov 23, 2017 at 03:24 GMT
We want to offer trial clusters, but we need quota/limits for those.
Per IaaS account (which we model today, maybe not ideally, as a secret):
Alternatively, for now, we could simplify that by a VMs quota, but then we would need to restrict the machine types (since they differ significantly in their resources and prices, especially when looking at the GPU-based machine types).
Per Gardener project (which we model as namespace):
Note: We may have/want to add more resource types over time. Some are infrastructure specific (if not abstracted) like e.g. the disk type.
This way we could achieve the following:
gp2
and 1 TB io1
Disk, 4 LB services and 200 GB gp2
and 50 GB io1
PVCs per cluster, cluster auto-termination after 28 daysgp2
Disk, cluster auto-termination after 7 daysio1
disks, or yet another to extend the lifetime of their clustersNote: See also #101. Note: Most of these quotas can be enforced by an admission controller in the garden cluster, but the LBs and PVC size quotas will require similar controllers in the respective shoot clusters themselves, which we can handle with lower priority/later. Actually, it would be nice if LBs and PVC size quotas could be specified on the IaaS account level for the entire IaaS account, but that would require the shoot cluster admission controller to call back to the garden cluster, which is currently not possible (not reachable), so let's go with the per-cluster simplification. Note: We support worker count min/max, i.e. cluster auto-scaling. Check the max requirements against the quota, i.e. assume the worst case. Note: If quotas are set for the IaaS account and the Gardener project, both must comply (simplest strategy for now). Note: Make sure a cluster member can't change these quotas. Later we may allow that, e.g. in a non-trial use case a customer purchases a certain quantity of resources and allots it to its projects. Note: Secrets are currently placed into the Gardener project and must be readable, but we don't want that. They should be "somewhere else" outside the namespace, so that their contents are protected (not be directly visible in the project, but when a cluster is created, they appear in the tf-vars of the infrastructure terraform files, which is bad, but acceptable for now). That would also ease setting the overal quota (per IaaS account). One possible solution to refer them is described in #102, but it's not ellegant as it requires two different code paths in the operator and UI to handle these cases. It would be better if we find a solution where the secret feels the same, whether its shared or not.
Implementation Proposal:
ResourceQuota
s, but in our API group (see 1 and 2) to describe quotas per IaaS account and Gardener project.ResourceQuota
s and admission controllers (per quota) or combine them into one in the first step, but this may make later extensions, especially when we open source, somewhat harder (still, if it speeds us up, let's go with a combined ResourceQuota
and admission controller)Comment by vlerenc Thursday Nov 23, 2017 at 03:29 GMT
Thank you @i068969 also for the external resources link. I checked it and it seems to be meant for a different use case. It allows to include more than just cpus/memory into the scheduler by advertising (it states explicitly per node) new resources that can then be requested (per pod). That's not what we are after here with quotas, I believe, but I don't know whether it can be "bend" to our needs (doesn't look like, though).
Comment by mliepold Friday Dec 22, 2017 at 15:12 GMT
Problem: There is a problem with keeping the quota status in the status sub-resource of the quota. With the current proposal there is only one quota resource for all projects that use the default quota. Proposals:
apiVersion: garden.sapcloud.io/v1beta1
kind: Quota
metadata:
name: trial-aws-project-quota
namespace: garden-trial
spec:
scope: project
clusterLifetimeDays: 14
metrics:
cpu: "20"
memory: 400Gi
status:
metrics:
- project: "garden-project1"
cpu: "6"
memory: 20Gi
- project: "garden-project2"
cpu: "15"
memory: 230Gi
Comment by mliepold Friday Dec 22, 2017 at 18:14 GMT
For the tracking of the Shoot cluster resources specified in the quotas, a gradual implementation will be done:
Comment by vlerenc Friday Dec 22, 2017 at 20:32 GMT
Let's have 2 quota controllers:
Moving to Gardener as this is no dashboard story: https://github.com/gardener/gardener/issues/81.
@mliepold Can you please check which of your comments you would like to move to the new ticket (if you do it, it'll be under your name), e.g.:
To be spec'ed:
By @vlerenc: We want to offer trial clusters, but we need quota/limits for those.
Per IaaS account (which we model today, maybe not ideally, as a secret):
Alternatively, for now, we could simplify that by a VMs quota, but then we would need to restrict the machine types (since they differ significantly in their resources and prices, especially when looking at the GPU-based machine types).
Per Gardener project (which we model as namespace):
Note: We may have/want to add more resource types over time. Some are infrastructure specific (if not abstracted) like e.g. the disk type.
This way we could achieve the following:
gp2
and 1 TBio1
Disk, 4 LB services and 200 GBgp2
and 50 GBio1
PVCs per cluster, cluster auto-termination after 28 daysgp2
Disk, cluster auto-termination after 7 daysio1
disks, or yet another to extend the lifetime of their clustersNote: See also #101. Note: Most of these quotas can be enforced by an admission controller in the garden cluster, but the LBs and PVC size quotas will require similar controllers in the respective shoot clusters themselves, which we can handle with lower priority/later. Actually, it would be nice if LBs and PVC size quotas could be specified on the IaaS account level for the entire IaaS account, but that would require the shoot cluster admission controller to call back to the garden cluster, which is currently not possible (not reachable), so let's go with the per-cluster simplification. Note: We support worker count min/max, i.e. cluster auto-scaling. Check the max requirements against the quota, i.e. assume the worst case. Note: If quotas are set for the IaaS account and the Gardener project, both must comply (simplest strategy for now). Note: Make sure a cluster member can't change these quotas. Later we may allow that, e.g. in a non-trial use case a customer purchases a certain quantity of resources and allots it to its projects. Note: Secrets are currently placed into the Gardener project and must be readable, but we don't want that. They should be "somewhere else" outside the namespace, so that their contents are protected (not be directly visible in the project, but when a cluster is created, they appear in the tf-vars of the infrastructure terraform files, which is bad, but acceptable for now). That would also ease setting the overal quota (per IaaS account). One possible solution to refer them is described in #102, but it's not ellegant as it requires two different code paths in the operator and UI to handle these cases. It would be better if we find a solution where the secret feels the same, whether its shared or not.
Implementation Proposal:
ResourceQuota
s, but in our API group (see 1 and 2) to describe quotas per IaaS account and Gardener project.ResourceQuota
s and admission controllers (per quota) or combine them into one in the first step, but this may make later extensions, especially when we open source, somewhat harder (still, if it speeds us up, let's go with a combinedResourceQuota
and admission controller)