bcgov / WALLY

Search for data, reports and other information to support water allocation decision making.
Apache License 2.0
5 stars 2 forks source link

Reduce WALLY Slack #635

Closed LolandaE closed 2 years ago

LolandaE commented 2 years ago

Describe the task The pods for WALLY are only using 1.4% of CPU resources requested. We need to reduce the amount we're requesting. This can be done through the OpenShift console.

Acceptance Criteria

Additional context

Notes from Platform Services Team The pods in your project set d1b5d2are using about 1.4% of the CPU resources requested. When resources are requested but not used, colloquially known as “slack”, it prevents any other teams in the community from using it, limits teams on-boarding to the platform, or requires us to unnecessarily purchase additional hardware. I will send out a calendar invite to meet via Teams soon to see how I can help reduce the “slack” in the above-mentioned project set. In the meantime, here are a few pro-tips that are sure to help:

Want to meet at your convenience? Use this link to book some time.

How to spot waste (Slack)?

Open your dev, test, and prod namespace in the OpenShift web console. Click on the CPU graph and expand it to two weeks (2w) as shown below. If you see a large gap between the blue and yellow lines, then there is work to be done.

How to better use CPU resources?

The best way to reduce waste is to make sure what your pod is requesting for CPU is no more than 2x what it uses; use limit values to account for spikes in usage. Here are several best practices that you can do to help eliminate waste:

In your deployment manifests bring your usage and request into alignment. Target setting your request to be a maximum of 2x what your pod uses on any given day. Deal with occasional usage spikes by adjusting your CPU limit: as per point #4 below.
Dev and Test pods don’t need to be provisioned like production. Be aggressive with your requested CPU in dev and test and as needed a little more generous in production. Consider putting unused environments (dev, test, tools) to sleep if they’re no longer under active development. Instructions to do this can be found in the Pro Tips below. Set your limit notably higher than your request to account for usage spikes, latency, or other performance related issues. Since none of our nodes / servers are at capacity, and as others optimize resource usage, you should be able to consistently get your limit: values. OpenShift will always try and give you’re your limit value. Use different request and limit values for dev, test, and prod. Dev does not need to be provisioned like production – nor does test. Consider keeping just 1 pod for your services in dev, max 2 in test, and 3 in production. You can also use a Horizontal Pod Autocaler (HPA) to automatically create new pods when others are busy.

Pro Tips 🤓

Put a service to sleep until it detects network traffic. This is a great way to conserve resources if you’re no longer actively using dev, test or tools. Use the command oc idle <service-name-here> or see an example here. If you won’t be using the environment for an extended period scale down the deployments with the command oc scale --replicas=0 <deployment-config-name-here> Use your limit more aggressively. Its only oversized request values that are causing us issues.