Closed mitovskaol closed 4 years ago
I prefer the idea "small is default for all environments prod and non-prod, if a bigger size is required , request a justification from the product team"
There is also the idea of Terminating and Non-Terminating pods, which have separate quotas. Terminating are things like the deployer pod, builds, and cron jobs. Since these are short lived they can reasonably have larger quota as its not used all the time.
I just ran some Excel Quartile funcs on one of the project stats spreadsheets. 75% of project have a CPU Request of 1.7 core or less, and a Limit of 6 core or less.
So I would purpose that small quota is REQ 2 / LIMIT 4 cores. Double for Medium and double again for Large. As the nodes we've specced out have 8 times the RAM as cores, we can probably use that for quotas as well. So small would be REQ 16Gi / LIMIT 32Gi.
Terminating and Non-Terminating quotas can be the same size as each other.
Matt just showed me we can manage quotas at a more global scale in OCP4 with ClusterResourceQuota that might be helpful for implementing tshirt sized quotas.
Selecting more than 100 projects under a single multi-project quota can have detrimental effects on API server responsiveness in those projects.
LOL, nevermind.
Project/Namespaces provision vs Quota for dynamic/"ephemeral" (PR-Based) environments
Understanding that a quota can be increased upon request, what are the max (there may be some technical limitation/constraint)
From the (Wildfire) predictive services teams:
We’re projecting that our production database is going to be growing by about 80GB a year (so with high availability, assuming 3 replicas, that’s 240GB a year), and if we assume keeping data around for 5 years (there’s no hard answer on that yet – the retention period may be longer) – that puts us at about 1.2TB.
CPU/RAM pod defaults:
CPU/RAM namespace quotas (per namespace): Small: ( provisioned by default for new namespaces) Long-running workload quotas: CPU: 4 cores as request, 6 cores as limit RAM: 16GBs as request, 32GBs as limit
Medium: (needs to be requested and justified) Long-running workload quotas: CPU: 8 cores as request, 16 cores as limit RAM: 32GBs as request, 64GBs as limit
To Be Continued: Spike-workload/Time-bound quota (e.g. concurrent builds or cron jobs): CPU: 8 cores as request, 16 cores as limit, RAM: 32GBs as request, 64GBs as limit
Large: (needs to be requested and justified) Long-running workload quotas: CPU: 16 cores as request, 32 cores as limit RAM: 64GBs as request, 128GBs as limit
Namespace storage quota: Small: ( provisioned by default for new namespaces) 20 PVC count , 50Gbs overall storage with 25 GBs for backup storage Medium: 20 PVC count , 100Gbs overall storage with 50 GBs for backup storage Large: 20 PVC count , 200Gbs overall storage with 100 GBs for backup storage
Process for request a large quota: to be determined.
I think the CPU, Memory, and Storage should each have their own size.
Maybe a team has Small CPU needs, Medium Memory needs, and Large Storage Needs.
We also needs to define the Custom
size, and how are they incremented (maybe always by 1 small
order of magnitude), and what is the upper limit/constraint?
As you already mentioned the long running
vs timeboud
quotas, we also need to talk about the best-effort
quota.
Maybe due to how the CPU usage is aggregated in 100ms intervals, some spike in database workload may experience freeze/slowness.
Can we also talk about user project self-provisioning quota? Was the ProjectRequestLimit plugin removed from OpenShift 4?
I like the idea of Medium being what is standard now. Small should be half and large should be double for simplicity.
I'm dealing with a few extremes; Cullen (kyrwwq-prod) which could easily fit into an extra-small, and OrgBook (devex-von-bc-tob-prod) that is already a large and bumps into scaling limitations. However most environments could likely be fit into a small.
Small should likely be the default until the team figures out otherwise.
I also like the idea of making concessions for environments (dev/tools) that are using PR-Based approaches.
@mitovskaol , we didn't have time to talk about the impact of those resources in team using PR-Based pipelines with dynamic environments. While I agree that each environment should be very lightweight, it still needs to be representative (as close as possible) to the production in both architecture and data (it does not need to be a full dataset, just small, yet representative one).
I know i've had teams requesting for the ability to simply spin up whole new temporary namespaces (that makes it much easier to clean up) as opposed to try to put multiple environment in the same namespace.
So, the quota mill have a direct impact on their ability to have multiple temporary/isolated environments, particularly in their dev
namespace.
We will be implementing this model for all namespaces in the Silver cluster and will monitor if this approach works and if not the quotas will need adjustment .
I would agree with the notes from Clecio.
We are using the BCDK pipeline, it's a PR based pipeline, it builds and spins up all apps for the PR and remove them after the changes are tested. So we need Large CPUs and memories but Small storage for our Dev. For our Test and Prod, we need Medium CPUs and memories and Large storage.
@kuanfandevops We discussed it as a team, and it was found that having different quotas for different namespaces would create too much administractive overhead for us to manage, thus, all namespaces in the project set will be given the same quota, either small or medium or large. This functionalist has now been implemented in the Project Registry, however, we will be asking for justification from the team when they request a quota increase.
in March 2021, the PVC counts for medium and large size quotas have been increased from 20PVCs to 40 PVCs and 60PVCs respectively.
I've confirmed the updated PVC counts in one of our Medium and Large project sets.
With the goal to make the resource usage on the Platform more efficient, a new quota system is needed to allocate resources to projects on OCP 4 Platform (Silver and Gold/DR clusters). One approach is to have a T-shirt size based resource allocation - small, medium and large where a certain amount of resources from the 3 resource categories (CPU/RAM/Storage) will be pre-packaged together and applied to all namespaces in a project set.
All new namespaces in OCP 4 will be created using the default quotas (small sizes for CPU/RAM/Storage as defined below). When a team needs to increase their project quota, they can do so in the Project Registry following the pre-dertermined upgrade path: small -> medium --> large.
Medium: (needs to be requested and justified) Long-running workload quotas: CPU: 8 cores as request, 16 cores as limit RAM: 32GBs as request, 64GBs as limit Medium: 40 PVC count , 100Gbs overall storage with 50 GBs for backup storage
Large: (needs to be requested and justified) Long-running workload quotas: CPU: 16 cores as request, 32 cores as limit RAM: 64GBs as request, 128GBs as limit Large: 60 PVC count , 200Gbs overall storage with 100 GBs for backup storage