Scaling dynamically on-demand using the same platform software composition

Design Authority review

Attending

Jason Darmanovich Johan Foley Justus Ortlepp

Problem statement

Solution/Approach

JD: Important to establish sensible metrics and approach for the evaluation of different tools so that we can model the problem we're trying to solve during the testing. JD: Keda (https://keda.sh/) appears to have the functionality we are looking.

Endurance/sustained run for a set period
Burst demand, check if the platform can dynamically grow, and release when demand diminishes.
Review platform performance in a 24-hour cycle by simulating network traffic. Two set peak periods would be ideal. Say a lunch-time peak around 13:00 to 14:00 and a day-end/batch processing peak around midnight. Maybe more of a plateau than a peak (spike) in demand. The test has to run for an least an hour (duration) to give the cloud provider fair time to spin up the required hardware to meet the demand.
One peak has to be a gradual climb, the other peak has to be sudden onset.
This will also be a good opportunity to see how the platform performs for a sustained period.
Semi-dynamic scaling at predictable times (i.e. we know a peak is coming and we can half-scale to prepare for it).
Keda provides multiple avenues to detect increased performance demand. CPU/RAM/Redis/Arango/NATS monitoring to trigger the scaling of EKUTA services.
Our testing will identify whether we should scale processor-by-processor, or step up the platform altogether.
Set up the Azure-based auto-scaling of AKS.
- This can be tested fairly quickly
Determine adequate mechanisms for scaling statefull services. i.e. how do we scale individual components while the platform is in flight (High effort)

Next steps

Figure out which metrics are available and how to interpret and act on changes
Benchmark the Azure platform's time to respond to scaling demand requests
How does Keda actually work? a. i.e. talking to the new node that has just been added? b. How long does it take to scale processors to the new node?
Ensure the pod scheduler does not prematurely evict running pods in order to move them a. Disable Horizontal Pod Auto-scaler (HPA)?
Test and document.
Once methodology established, determine adequate mechanisms for scaling statefull services.

frmscoe / General-Issues

Scaling dynamically on-demand using the same platform software composition #300

Story statement

Acceptance criteria

Design Authority review

Attending

Problem statement

Solution/Approach

Next steps