Netflix-Skunkworks / service-capacity-modeling

Apache License 2.0
67 stars 19 forks source link

Add instance m5d large #4

Closed szimmer1 closed 3 years ago

szimmer1 commented 3 years ago

@jolynch couple of questions:

  1. EC2 specs in the aws.json seem very precise and not consistent with anything I've found online. Can you confirm the numbers in this PR look fine?
  2. When I tested the zookeeper model with 10gib state size, I get a plan with 2 least_regret Clusters, and the last Cluster using m5d.large's. This isn't expected IMO since the model should be rejecting instances with <10GiB RAM (m5d.large has 8GiB). Is my understanding correct?
jolynch commented 3 years ago

EC2 specs in the aws.json seem very precise and not consistent with anything I've found online. Can you confirm the numbers in this PR look fine?

Unfortunately the public data is inaccurate, often saying GiB (base 2) when they mean GB (base 10) or even some conversions just aren't even close to what we really get when we boot boxes. @arunagrawal84 pointed this out and corrected the existing types but in general it isn't a super pronounced effect (introduces an error of <5% on most instances). That being said if we have running instances we can always just trust-but-verify (I left feedback for the two "true" numbers of RAM and disk.)

When I tested the zookeeper model with 10gib state size, I get a plan with 2 least_regret Clusters, and the last Cluster using m5d.large's. This isn't expected IMO since the model should be rejecting instances with <10GiB RAM (m5d.large has 8GiB). Is my understanding correct?

If this were a certain desire it would be unexpected (so if you move the minimum data size up to 10 GiB and the maximum down from 100). The range represents the inherent uncertainty in the user input, they might end up with 1 GiB data or 30 GiB data. We regret it when we pay for RAM we don't use (or don't pay for RAM we need), so the least regret model finds that over many simulations r5d.large is least regretful and m5d.large is second least regretful (presumably due to the cases where we ended up with lots of traffic or less data). The choices are forced to different instance families in case there is insufficient capacity in the first one, so we'd never see r5d.large followed by r5d.xlarge because the point of the choices is to give us choices in case we can't buy the computer we want.

Least regret is like a weather forecast operating on very uncertain inputs that wildly affect the outcome, and it's trying to minimize a regret function (a function of paying too much or too little and not having enough disk). In this case I think it's reasonable to add to the regret function if we don't have enough memory for the disk requirement.

jolynch commented 3 years ago

@szimmer1 I fixed up the json and incorporated memory regret. Looking at that second plan it's ok with m5.large because in that particular reality there was only 7.5 GiB of state:

ipdb> plan.least_regret[1].requirements.zonal[0].cpu_cores.mid
2.0
ipdb> plan.least_regret[1].requirements.zonal[0].mem_gib.mid       
7.465308159055264
ipdb> plan.least_regret[1].requirements.zonal[0].disk_gib.mid
29.861232636221057

If we crank up the memory regret we'll effectively ignore all the possible realities where the lower end of the input happens (e.g. the 1-10GiB part of the range). If you want to not consider those realities I'd just raise the lower bound.