gardener / dashboard

Web-based GUI for Gardener installations.
Apache License 2.0
209 stars 103 forks source link

Introduce Quota/Limits #14

Closed gardener-robot-ci-1 closed 6 years ago

gardener-robot-ci-1 commented 6 years ago

To be spec'ed:

By @vlerenc: We want to offer trial clusters, but we need quota/limits for those.

Per IaaS account (which we model today, maybe not ideally, as a secret):

Alternatively, for now, we could simplify that by a VMs quota, but then we would need to restrict the machine types (since they differ significantly in their resources and prices, especially when looking at the GPU-based machine types).

Per Gardener project (which we model as namespace):

Note: We may have/want to add more resource types over time. Some are infrastructure specific (if not abstracted) like e.g. the disk type.

This way we could achieve the following:

Note: See also #101. Note: Most of these quotas can be enforced by an admission controller in the garden cluster, but the LBs and PVC size quotas will require similar controllers in the respective shoot clusters themselves, which we can handle with lower priority/later. Actually, it would be nice if LBs and PVC size quotas could be specified on the IaaS account level for the entire IaaS account, but that would require the shoot cluster admission controller to call back to the garden cluster, which is currently not possible (not reachable), so let's go with the per-cluster simplification. Note: We support worker count min/max, i.e. cluster auto-scaling. Check the max requirements against the quota, i.e. assume the worst case. Note: If quotas are set for the IaaS account and the Gardener project, both must comply (simplest strategy for now). Note: Make sure a cluster member can't change these quotas. Later we may allow that, e.g. in a non-trial use case a customer purchases a certain quantity of resources and allots it to its projects. Note: Secrets are currently placed into the Gardener project and must be readable, but we don't want that. They should be "somewhere else" outside the namespace, so that their contents are protected (not be directly visible in the project, but when a cluster is created, they appear in the tf-vars of the infrastructure terraform files, which is bad, but acceptable for now). That would also ease setting the overal quota (per IaaS account). One possible solution to refer them is described in #102, but it's not ellegant as it requires two different code paths in the operator and UI to handle these cases. It would be better if we find a solution where the secret feels the same, whether its shared or not.

Implementation Proposal:

gardener-robot-ci-1 commented 6 years ago

Comment by mliepold Tuesday Nov 07, 2017 at 10:39 GMT


Abstract

Currently every user or project owner has to bring its own secret(s) to create clusters. In order to increase adoption of the Gardener, we want to introduce trial clusters. But for these we need to introduce quotas to be able to limit the amount of created clusters and resources and thus the possible costs for a potential sponsor.

Requirements

The quotas/limits need to be set for (account) secrets belonging to a Gardener project. A project is represented by a K8S namespace "garden-projectName" (with a label "garden.sapcloud.io/role: project").

In order to avoid the need to create specific trial projects and to copy trial secrets, secrets need to be referable from different namespaces (i.e. projects). Also individual projects that use a trial secret should only be allowed to use a certain share of the overall quota associated with it. Thus it is needed to set quotas on project level too.

By default every project using a trial quota should have a predefined quota. But projects that have special needs can request an increase of their quota.

Also the created trial cluster should automatically be cleaned up after a defined amount of days.

Design

Metrics for quota

Metrics should be set for the most important chargeable resources of a cloud provider. Initially we specify the following metrics:

Defining quotas

The Gardener UI will not support the creation of quotas (in the beginning). It has to be done via the K8S CLI for already existing secrets. For this a new custom resource "Quota" of the gardener is used.

Implementation

Quota

apiVersion: garden.sapcloud.io/v1beta1
kind: Quota
metadata:
  name: trial-aws-secret-quota
  namespace: garden-trial
spec:
  scope: secret
  clusterLifetimeDays: 14
  metrics:
    cpus: "200"
    gpus: "20"
    memory: 4000Gi
    disk.basic: 8000Gi
    disk.standard: 8000Gi
    disk.premium: 2000Gi
    loadbalancers: "100"

Specification:

Note: The metric for cpus uses the worst-case scenario with the maximum possible CPUs per cluster in the beginning, if auto scaling is set for clusters. This means even if the actual amount of CPUs is not reached yet, the creation of new clusters could not be possible anymore! It ensures that the total costs can never be higher than defined by the quota. But the operator of the week should monitor the actual amount of CPUs for a secret to possibly adjust the numbers.

The quota can be set for the Gardener resource Secret specified by:

Controllers

Quota Controller

It is responsible for updating the status of all defined quotas. To do this, it has to monitor all existing shoot clusters which

The controller also regularly checks all existing resources of kind "Shoot" for the annotation "quota.garden.sapcloud.io/creationTime". If the current date is after this date plus the expiration time of the project quota, the gardener triggers the deletion of this shoot cluster. The user will get no notification before or when the cluster is deleted!

Quota Admission Controller

It checks the current status of the quotas that are used during the creation of a shoot cluster. If I quota would be exceeded after the cluster creation, the creation will be prevented and the user gets an appropriate error message.

Binding Secrets and Quotas

For every secret in a project namespace, a PrivateSecretBinding is created. It binds the secret with a possible quota for the secret. The referenced secret and quota have to be in the same namespace. Also for the Garden projects that can use a trial secret, a CrossSecretBinding is created in the project namespace. It binds the trial secret with the secret quota and the (default) project quota. These referenced objects reside in the trial namespace.

At the secret selection during the shoot creation, the Gardener UI will use all secret bindings (PrivateSecretBinding and CrossSecretBinding) in the current project.

Open Issues

Extended Scope

Out of Scope

gardener-robot-ci-1 commented 6 years ago

Comment by mliepold Friday Nov 17, 2017 at 11:11 GMT


Remarks/proposals during Weekly Kube:

  1. @d021332: Why not just using annotations instead of an own CRD?
  2. @i068969: K8S is currently discussing a more generic ResourceQuota definition. Martin will provide a link.
  3. @d021770: Instead of providing a Quota CRD, make the account a CRD that contains the account and quota information

Responses (decided during Operator Sync meeting):

  1. We will stick with an own CRD, because it is a more elegant way to manage quotas. Also K8S has an own resource ResourceQuota. But we will try to find a way to keep the effort for the UI low, to search and display the secrets (normal + shared).
  2. Implementing a new account resource would be too much effort. Trial quotas should be available soon to increase adoption of the service.
gardener-robot-ci-1 commented 6 years ago

Comment by mvladev Monday Nov 20, 2017 at 11:26 GMT


I'm still searching for the google doc for generic resources. Extended resources can provide a nice starting point on how to setup API types.

This logic should not be part of the garden-controller at all. This is a job for a External Admission WebHook or part of a custom API server. Shoot resources should not be created in the API server, if they are not accepted by the admission plugins.

gardener-robot-ci-1 commented 6 years ago

Comment by rfranzke Monday Nov 20, 2017 at 12:00 GMT


Good points, external admission checks seem to be the best way of doing that.

gardener-robot-ci-1 commented 6 years ago

Comment by vlerenc Tuesday Nov 21, 2017 at 04:42 GMT


Whoa, what a relief. Thank you @i068969 for the admission check proposal. That sounds much more in line with Kubernetes, hence I like it very much. Also thank you for the extended resources idea. I am not certain whether it can be leveraged, but also here my hope is, we find a well integrated mechanism for quotas. After all, that's kind of natural to expect from something like Kubernetes.

The other design aspect I am concerned with is the complexity with the shared secrets:

  1. I am hopeful we can avoid increasing the complexity in the UI, which with the current design would have to make two different calls to get all the secrets
  2. Shared secrets are globally shared, but it would be nicer to bind them to projects/namespaces and thereby limit their reach
  3. Also, when the UI writes the secret into the cluster CRD, it needs to differentiate between a standard and a shared secret with the current design, which makes the UI more complex
  4. Only cosmetic, but I don't like the namespace prefix (if it isn't used already in other Kubernetes contexts) and would prefer, only if we can't avoid (3), to use name and namespace as separate properties
gardener-robot-ci-1 commented 6 years ago

Comment by vlerenc Thursday Nov 23, 2017 at 03:24 GMT


We want to offer trial clusters, but we need quota/limits for those.

Per IaaS account (which we model today, maybe not ideally, as a secret):

Alternatively, for now, we could simplify that by a VMs quota, but then we would need to restrict the machine types (since they differ significantly in their resources and prices, especially when looking at the GPU-based machine types).

Per Gardener project (which we model as namespace):

Note: We may have/want to add more resource types over time. Some are infrastructure specific (if not abstracted) like e.g. the disk type.

This way we could achieve the following:

Note: See also #101. Note: Most of these quotas can be enforced by an admission controller in the garden cluster, but the LBs and PVC size quotas will require similar controllers in the respective shoot clusters themselves, which we can handle with lower priority/later. Actually, it would be nice if LBs and PVC size quotas could be specified on the IaaS account level for the entire IaaS account, but that would require the shoot cluster admission controller to call back to the garden cluster, which is currently not possible (not reachable), so let's go with the per-cluster simplification. Note: We support worker count min/max, i.e. cluster auto-scaling. Check the max requirements against the quota, i.e. assume the worst case. Note: If quotas are set for the IaaS account and the Gardener project, both must comply (simplest strategy for now). Note: Make sure a cluster member can't change these quotas. Later we may allow that, e.g. in a non-trial use case a customer purchases a certain quantity of resources and allots it to its projects. Note: Secrets are currently placed into the Gardener project and must be readable, but we don't want that. They should be "somewhere else" outside the namespace, so that their contents are protected (not be directly visible in the project, but when a cluster is created, they appear in the tf-vars of the infrastructure terraform files, which is bad, but acceptable for now). That would also ease setting the overal quota (per IaaS account). One possible solution to refer them is described in #102, but it's not ellegant as it requires two different code paths in the operator and UI to handle these cases. It would be better if we find a solution where the secret feels the same, whether its shared or not.

Implementation Proposal:

gardener-robot-ci-1 commented 6 years ago

Comment by vlerenc Thursday Nov 23, 2017 at 03:29 GMT


Thank you @i068969 also for the external resources link. I checked it and it seems to be meant for a different use case. It allows to include more than just cpus/memory into the scheduler by advertising (it states explicitly per node) new resources that can then be requested (per pod). That's not what we are after here with quotas, I believe, but I don't know whether it can be "bend" to our needs (doesn't look like, though).

gardener-robot-ci-1 commented 6 years ago

Comment by mliepold Friday Dec 22, 2017 at 15:12 GMT


Tracking the quota status

Problem: There is a problem with keeping the quota status in the status sub-resource of the quota. With the current proposal there is only one quota resource for all projects that use the default quota. Proposals:

  1. Keep the quota states of all projects that use the default quota.
    apiVersion: garden.sapcloud.io/v1beta1
    kind: Quota
    metadata:
    name: trial-aws-project-quota
    namespace: garden-trial
    spec:
    scope: project
    clusterLifetimeDays: 14
    metrics:
    cpu: "20"
    memory: 400Gi
    status:
    metrics:
    - project: "garden-project1"
    cpu: "6"
    memory: 20Gi
    - project: "garden-project2"
    cpu: "15"
    memory: 230Gi
  2. Only keep the status at the quota object for the secret. For the project quotas determine the status on the fly at the admission controller. Like this the expensive calculation of the quota status for the secret is done independently in the quota controller. Only the status calculation for the project quota, which is much faster, is done in the admission controller during the shoot cluster creation.
gardener-robot-ci-1 commented 6 years ago

Comment by mliepold Friday Dec 22, 2017 at 18:14 GMT


For the tracking of the Shoot cluster resources specified in the quotas, a gradual implementation will be done:

  1. The resources are only determined statically. This means only metrics that can be determined from the Shoot manifest can be used. Also the maximum resource consumption is assumed. E.g. when specifying auto scaling from 1 to 4 workers with a 2 CPU machine, 8 CPUs are added to the quota metric.
  2. Most resources are still determined statically, but for the amount of used CPUs the NodeController is used to deliver the exact amount of current nodes and thus the used CPUs.
  3. Dynamically determine all resources specified in the quota metrics. This requires the quota controller to call back to the Garden cluster, which is currently not possible.
gardener-robot-ci-1 commented 6 years ago

Comment by vlerenc Friday Dec 22, 2017 at 20:32 GMT


Let's have 2 quota controllers:

mliepold commented 6 years ago

Event handling

Quota Events:

CrossSecretBindings Events:

Shoot Events:

vlerenc commented 6 years ago

Moving to Gardener as this is no dashboard story: https://github.com/gardener/gardener/issues/81.

@mliepold Can you please check which of your comments you would like to move to the new ticket (if you do it, it'll be under your name), e.g.: