Update to Resource While in Scaled Down State

droessmj commented 3 years ago

Is your feature request related to a problem? Please describe.

Behavioral clarification question -- what's the expected behavior if my resource is in a scaled down state and I release my "desired state" ARM template which changes the resource's normal target state?

For example, imagine I'm working with a VM. Normal scaled up state is a Standard_D8_v4, scaled down state is a Standard_D2_v4. While in a scaled down down window, I redeploy my resource/resource group with a desired scale up state of Standard_D16_v4. My understanding is that the deployment would apply the newly defined state (VM changed to D16). However, the next time the Engine runs it would queue a scale down for this resource if it were still in the scale down window. Then the resource would be scaled down to target down state (VM returns to D2) with the newly defined state (D16) remaining as the target scale up state. Does this understanding align with your own? As you build out e2e tests is this something we could work into the mix?

I'm looking at this from the perspective of an enterprise cloud foundations product owner. Super useful feature set, but at our scale I'm sure we'll need to have answers for the edge cases.

DCMattyG commented 3 years ago

Hey @droessmj, thanks so much for reaching out!

The current way Bellhop functions is based on tags applied to the resource. Of the tags, there are two important ones here that I'll talk about to describe how the scaling operations works. On each target resource, there are a series of tags that describe the target state when the resource is scaled down. Those tags are pre-pended with "setState-" and these are what Bellhop reads to determine the dimensions of the resource for the scale down operation. When the resource is scaled down, Bellhop uses another set of tags pre-pended with "saveState-" to remember the state of the resource prior to scaling it down so that it can be scaled up as it was before. When the resource is in the window to be scaled down, Bellhop will check for the existence of the "saveState-" tags, and that is how it determines that the particular resource is in the scaled down state.

In your scenario, what would happen is that the resource would be in the scaled down state (Standard_D2_v4), and the "saveState-" tags would be set to remember that the previous size was a Standard_D8_v4. Then, presumably and ARM template (or equivalent) would be applied to change the VM to a Standard_D16_v4 while the VM was in a scaled "down" state from the perspective of Bellhop. At this point, since the "saveState-" tags would likely still be applied to the resource (depending on how you update the VM), Bellhop wouldn't take any action and the VM would remain as a Standard_D16_v4 until the scale up window rolls around. At that point the VM would be reverted to a Standard_D8_v4, as that is the metadata that Bellhop placed in the "saveState-" tags.

In the near future, we'll be moving away from using tags for configuration and likely into a database (such as CosmosDB) which will allow much more robust scaling configurations that would simply result in way too many tags today. Once that happens, we can have more advanced logic to identify if a state change has taken place while the VM was in a scaled down state, and correct things accordingly.

All of that said, I'd love to hear more about your use case and we'll certainly do what we can to incorporate those aspects into the upcoming iterations of Bellhop. I hope that cleared things up for you!

droessmj commented 3 years ago

Great response, @DCMattyG. I did not get deep enough into the implementation code to note the saveState tags. The switch to backing with CosmosDB makes sense to me as does everything else you outlined above.

Would the roadmap include outright deprovisioning of a resource during off hours? This may not be something that's supported across the board as there are data persistence problems to solve for -- although recovery from last backup could feasibly solve for this depending on backup frequency (SQL - probably OK, VM - eh...). This may sound kind of stupid in isolation, but we have a ton of Premium SKU App Service Plans that must be Premium for the security controls provided. Realistically, it could be cheaper 9/10 times to deprovision them per the schedule -- if we can solve for handling the Apps (Function/Web/Logic) built atop them in the interim.

I see you have a work item to build out a scaler template. Are there any specific resources you'd consider out of scope? I see your work item for AKS VMSS and agree with the content thus far. I'm curious if Application Gateways would be a candidate for a Scaler implementation? The v1 SKU seems like a no-brainer, but even the v2 SKU could use some tweaking of autoscale min/maxes with Bellhop (as min capacity increases baseline run rate).

Another roadmap thought would be whether there's a way to allow for an ad-hoc "temporarily return to scale up state for the current scale down window only". I'm thinking of the times where our QA will want to run load test off-hours. It's not a regular occurrence, but it's regular enough that they'd need an interface to return a given workload to normal scale pre-test, ideally returning to scaled-down state post-test. I'm sure the current interface could support it with some larger process hackery, but I'm thinking a single tag with precedence could also serve the same purpose.

I'd be interested in hopping in a bit on some of these items above once I have a more clear picture of desired direction. A few more tests would also go a long way towards making this more approachable, imho.

CloudViking commented 3 years ago

Hi @droessmj! To echo @DCMattyG, thank you so much for reaching out to us and engaging on the Bellhop project!

This is definitely a great conversation to have and we appreciate all of the ideas and feedback. Let me see if I have a complete understanding of all of the asks, and I will respond inline:

Complete Azure Resource deprovisioning/shutdown
- As you pointed out this is difficult to provide across the board because of the complexity of data persistence and state. We specifically built Bellhop with this in mind and opted to go with the scale down method as this was far easier to implement and kept the actual Bellhop infrastructure costs down. The idea of recovering from the last backup is interesting but is not a priority for Bellhop right now.
Scaler for Application Gateway
- This is a good feature request and deserves research. Initial pricing indicates that a Basic "Large" App Gateway is ~$233.60/month to run, while a Basic "Small" costs ~$18.25/month. Good idea!
Any Out of Scope Services?
- There are no inherently out-of-scope services for Bellhop. We welcome anyone to submit a Feature Request for a service they would like covered by Bellhop!
- We would also encourage anyone to fork the Bellhop repo, develop any changes they would like to see, and submit a PR back to our Main branch for review. (We should have an initial test framework implemented very soon)
More complete time management
- We are currently working on/investigating how to implement more complete scheduling/time management via Issue #8
- Running tests off hours is a good use case to scale up semi-regularly during a scale-down period
- Is your idea something like a "testState-" tag, and then being able to set that to some semi-regular schedule?

We would be happy to hop on a call with you to discuss further, you can also email the team at bellhop@microsoft.com with any future questions. We look forward to future collaboration.

Thank you!

droessmj commented 3 years ago

@CloudViking I will follow up via email on further discussions.

Re: Idea for test management - I don't have a preferred implementation offhand. I don't know if a temporary superseding tag is better than amending the normal tags to force the size up, or if there are other approaches that could solve the problem. I was mostly thinking through the problems I'd expect to encounter if we embraced this tool.

CloudViking commented 3 years ago

Sounds great! We do really appreciate the engagement and are happy to discuss further use cases, enhancements, and issues anytime.

I opened Issue #46 for the App Gateway Scaler.

Re: Test management - Good thoughts on this, we will absolutely consider any ideas and implementations for this. Even getting a clearer picture of real use cases is helpful.

I will close this issue now, feel free to reach out with any other questions.

Thank you.

Azure / bellhop

Update to Resource While in Scaled Down State #44