bcgov / wps

Wildfire Predictive Services to support decision making in prevention, preparedness, response and recovery
Apache License 2.0
41 stars 9 forks source link

DEV Task: Review required CPU Request limit #4143

Open dgboss opened 1 day ago

dgboss commented 1 day ago

We need to review our required CPU Request limits in e1e498-dev and e1e498-test and respond to the platform services email before Dec 30. They are basing their CPU quota clawback based on a single point in time observation of resource usage. Since we didn't have any PRs opened and nothing running in e1e498-test, they want to claw back a lot of quota.

From: Platform Services Team Mailbox CITZ:EX PlatformServicesTeam@gov.bc.ca
Date: Thursday, November 28, 2024 at 10:39 AM
To: McLoughlin, Neal FOR:EX Neal.McLoughlin@gov.bc.ca, Boss, Darren WLRS:EX Darren.Boss@gov.bc.ca, Brady, Conor WLRS:EX Conor.Brady@gov.bc.ca
Subject: CPU Quota Adjustment Notice for e1e498

Hello Team,

The Platform Services team has identified a mismatch between the amount of CPU quota assigned to your Project, e1e498, and the CPU resources your application is using. We plan to correct this by adjusting your namespace's CPU quota in one month, on December 30th, 2024 at 9 am.

What does this mean?

Currently, your namespace e1e498-dev has 32 core of CPU limit allocated, while your application claims 0.325 core of CPU limit. This setup means that, even at a high load, your application will only use up to 0.325 core. We plan to reduce your namespace quota to 1.5 core, ensuring you'll still have ample CPU available for your needs.

The namespace e1e498-prod is a production environment. Please note that -prod namespace allocations are for current use statistics only, and this update will exclude all -prod namespaces from any quota adjustments. Currently, your namespace e1e498-prod has 32 core of CPU limit allocated, while your application claims 14.38 core of CPU limit. This setup means that, even at a high load, your application will only use up to 14.38 core. We are not gonna reduce your prod namespace quota to 16 core, but please be aware of this and you should only claim CPU that your project needs.

Currently, your namespace e1e498-test has 32 core of CPU limit allocated, while your application claims 3.005 core of CPU limit. This setup means that, even at a high load, your application will only use up to 3.005 core. We plan to reduce your namespace quota to 4 core, ensuring you'll still have ample CPU available for your needs.

Will this affect my application?

Optimizing resource allocation in the Platform Product Registry allows fair access to resources across all teams. By reducing your unused “reserved” CPU, we can support the growing needs of existing teams and onboard new ones effectively.

Will this affect my application?

This adjustment won’t impact your running applications. Everything running in e1e498-dev,e1e498-prod,e1e498-test on 2024-11-27 at 14:02 will continue to run smoothly, with ample space for cron-jobs. Your pod(s) CPU requests and limits will stay the same and no downtime will occur.

What if I need the extra quota for other deployments?

This change won’t affect any deployments you have running now. But if your team plans to run more deployments or resource-intensive tasks (such as virus scans) in this namespace later, please let us know. We appreciate your help in optimizing resource use! Just reply to this email and we’ll keep your namespace out of the quota adjustment.

We appreciate your efforts in optimizing resource use! Reply to this email, and we’ll exclude from the quota adjustment.

When will the change happen?

This change is scheduled for December 30th, 2024 at 9 am.

Is there any action needed on my part?

No action is required, but feel free to reach out with questions or concerns by replying to this email. If you manage multiple Projects with mismatched quotas, you may receive additional emails like this. Note that only dev, test, or tools namespaces are affected by this change—prod namespaces will remain unaffected.

Best regards,

The Platform Services team

dgboss commented 1 day ago

Here's a screenshot of resource utilization in dev for a single PR deploymnet. Image