Scaling down VM scale set for Service Fabric - storage accounts need to be deleted?

MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure

https://docs.microsoft.com/azure

Creative Commons Attribution 4.0 International

10.31k stars 21.48k forks source link

Scaling down VM scale set for Service Fabric - storage accounts need to be deleted? #6150

Closed MarkAtAgilliance closed 5 years ago

MarkAtAgilliance commented 6 years ago

Hi, I noticed that after following the manual remove steps, the storage account for the VM that was removed was still enabled. Should we delete this manually too? May be worth a side note in this article. Regards, Mark.

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 38b364e7-f84d-2a3a-0c05-c8402da60a0f
Version Independent ID: 511c563c-8651-477d-ec95-3cbd0af30f8d
Content: Scale a Service Fabric cluster in or out
Content Source: articles/service-fabric/service-fabric-cluster-scale-up-down.md
Service: service-fabric
GitHub Login: @ChackDan
Microsoft Alias: chackdan

ChackDan commented 6 years ago

this depends on the way you have written your template, if you had used the portal or the old cluster templates we had in our sample, the VM OS VHD was being assigned to one of the five strorage accounts, in a round robin fashion. if you are sure that the storage is not being used by any of your VMs, you can delete it.

MarkAtAgilliance commented 6 years ago

Sure. The way we are looking at this: We need to reduce the foot print of our cluster to save some money (we are going from 6 to 3 - minimum for Bronze until we get more funding). So in scaling down the VMs (which this article describes nicely), I was thinking about what else do we need to do to clean up. That's when I saw the 6 storage accounts for the scale set still active. I'm assuming that there was a 1:1 between the node IDs and the storage accounts. When you say "round robin", I am wondering if I end up removing the wrong storage account.

Our stateless and stateful services did not use local storage directly. The cluster was built about 18 months ago.

My document suggestion is to cover storage account cleanup as a side note so readers know there may be more work to do after reducing the number of scale set nodes.

quantumtunneling commented 6 years ago

We have experience MANY problems with scaling Service fabric machine counts. It takes forever for the machines to come online (frequently more than 1 hour) and often times the entire cluster just doesn't scale at all and all the new machines end up in a hung state, and then we are charged for these non-functioning machines. I don't understand how the absolute most basic feature of a system like this doesn't even come close to working.

ChackDan commented 6 years ago

When ever you have a production issue, please file a CSS ticket. In that way we can get to the bottom of what is going on. In general, most of the time is spend for the VMSS instances to come up, once they come up the SF extensions get deployed, and then an fabric upgrade is rolled out. you are able to see the status of the fabric upgrades in SFX or via API (https://docs.microsoft.com/en-us/powershell/module/servicefabric/get-servicefabricclusterupgrade?view=azureservicefabricps).

The choices for durability you make also determined how fast the scale operations of any VMSSS model changes run.
@quantumtunneling - I want to hear more details on your experiences, do email me at chackdan@microsoft.com, would like to set up a call to learn more, so that we can improve the current offering.

We realize that the cluster operations are not easy and that is one of the main motivations for us to introduce https://aka.ms/sfmeshpreview , a serverless offering, so that the customer just focuses on the applications and we provide the infra needed.

as for the documentation - @aljo-microsoft is updating them, to reflect some of the suggestions we can provide to make the cluster scaling operations easier.

quantumtunneling commented 6 years ago

Can you provide more details about what a CSS ticket is, where I create them, and how it differs from the ticket that I created?

I get that different machine sizes have different scalability behaviors, but regardless of machine size, we shouldn't be experiencing a failure to scale that results in the entire cluster being unresponsive for 12 hours or so right? Can you elaborate on what caused this issue? Why is it happening? Why isnt it fixed?

Most importantly, we need more machines ASAP in that cluster. What do we need to do to scale it up?

On Tue, May 29, 2018, 8:55 PM Chacko Daniel notifications@github.com wrote:

When ever you have a production issue, please file a CSS ticket. In that way we can get to the bottom of what is going on. In general, most of the time is spend for the VMSS instances to come up, once they come up the SF extensions get deployed, and then an fabric upgrade is rolled out. you are able to see the status of the fabric upgrades in SFX or via API ( https://docs.microsoft.com/en-us/powershell/module/servicefabric/get-servicefabricclusterupgrade?view=azureservicefabricps ).

The choices for durability you make also determined how fast the scale operations of any VMSSS model changes run. @quantumtunneling https://github.com/quantumtunneling - I want to hear more details on your experiences, do email me at chackdan@microsoft.com, would like to set up a call to learn more, so that we can improve the current offering.

We realize that the cluster operations are not easy and that is one of the main motivations for us to introduce https://aka.ms/sfmeshpreview , a serverless offering, so that the customer just focuses on the applications and we provide the infra needed.

as for the documentation - @aljo-microsoft https://github.com/aljo-microsoft is updating them, to reflect some of the suggestions we can provide to make the cluster scaling operations easier.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/6150#issuecomment-393022436, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNNP0j4kiz2zlxeEpcB-MUEyq5VJ_a2ks5t3hg_gaJpZM4S3nmR .

ChackDan commented 6 years ago

Here is the support option for production support - https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-support#report-production-issues-or-request-paid-support-for-azure

once the person on call, reviews the logs, s/he would be able to answer the specific issue you are hitting.

quantumtunneling commented 6 years ago

Thanks Chaco. Can you tell me how we can scale the cluster up? We desperately need more machines ASAP, thanks,

Graham

On Tue, May 29, 2018, 9:30 PM Chacko Daniel notifications@github.com wrote:

Here is the support option for production support - https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-support#report-production-issues-or-request-paid-support-for-azure

once the person on call, reviews the logs, s/he would be able to answer the specific issue you are hitting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/6150#issuecomment-393026925, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNNP1fZoYxn6yweW0TiGq29zjPqRcSYks5t3iBIgaJpZM4S3nmR .

aljo-microsoft commented 6 years ago

Hello Graham, what node type are you scaling up? Do you need to scale vertically or horizontally? Are you adding new workloads, or keeping existing? Are you running in Azure or on prim? Determining how to scale up your cluster requires business and work load context. I'd be happy to take a call to learn more about this, and figure out how we can help you achieve more. The following are some of our documentation on scaling: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-scale-up-down https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-programmatic-scaling https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-upgrade-primary-nodetype-vm

ChackDan commented 6 years ago

You should be able to go to VMSS (mapped to the nodetype you want to add nodes to) and change the instance count. the VMs should get added to your cluster in 15 - 30 mins . If that is not happening, then we need to review your cluster logs, for which you need to file a ticket at https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-support#report-production-issues-or-request-paid-support-for-azure . Personally, I have seen issues only on scale downs, when the node count is less than what the reliability value dictates or the scale down is causing your services to go unhealthy.

If you are running stateless applications (and you are ok with loosing session state on scale down), you can just add a new VMSS (durability bronze) and map it to an existing Nodetype. this offcourse does not work on primary nodetype (which can span only one VMSS)

Any how, a lot of available options are dependent on kinds of workload you have running. Garham, if you like to have a quick call to explore options - do send me an email. We can set up time.

quantumtunneling commented 6 years ago

Hey aljo,

We had issues scaling up the WorkerP0 nodetype (All on Azure, we don't have any on prem custers). Yesterday at 10:30 AM PST, we manually scaled UP the WorkerP0 cluster from 18 machines to 60 machines. Then at 11:30 AM PST, we initiated another scale UP before the first one had finished. We believe that the second scale up tried to cancel the first scale up (Need verification of this idea), and that this was the cause of the cluster becoming unresponsive. Could you verify this idea? Thanks,

-Graham

On Wed, May 30, 2018 at 9:15 AM, aljo-microsoft notifications@github.com wrote:

Hello Graham, what node type are you scaling up? Do you need to scale vertically or horizontally? Are you adding new workloads, or keeping existing? Are you running in Azure or on prim? Determining how to scale up your cluster requires business and work load context. I'd be happy to take a call to learn more about this, and figure out how we can help you achieve more. The following are some of our documentation on scaling: https://docs.microsoft.com/en-us/azure/service-fabric/ service-fabric-cluster-scale-up-down https://docs.microsoft.com/en-us/azure/service-fabric/ service-fabric-cluster-programmatic-scaling https://docs.microsoft.com/en-us/azure/service-fabric/ service-fabric-cluster-upgrade-primary-nodetype-vm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/6150#issuecomment-393222685, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNNP5eU-JXAZ6WYYPh8q-TQjSXDNO1Vks5t3sW1gaJpZM4S3nmR .

quantumtunneling commented 6 years ago

Hey Chacko, VM's take around 45 minutes, sometimes longer before they come online, and this is an ongoing issue that we have had since day one. To recap what I told aljo,

We had issues scaling up the WorkerP0 nodetype (All on Azure, we don't have any on prem custers). Yesterday at 10:30 AM PST, we manually scaled UP the WorkerP0 cluster from 18 machines to 60 machines. Then at 11:30 AM PST, we initiated another scale UP to 85 machines before the first scale up had finished. We believe that the second scale up tried to cancel the first scale up (Need verification of this idea), and that this was the cause of the cluster becoming unresponsive. Could you verify this idea? Thanks,

Graham

On Wed, May 30, 2018 at 9:18 AM, Chacko Daniel notifications@github.com wrote:

You should be able to go to VMSS (mapped to the nodetype you want to add nodes to) and change the instance count. the VMs should get added to your cluster in 15 - 30 mins . If that is not happening, then we need to review your cluster logs, for which you need to file a ticket at https://docs.microsoft.com/en-us/azure/service-fabric/ service-fabric-support#report-production-issues-or-request- paid-support-for-azure . Personally, I have seen issues only on scale downs, when the node count is less than what the reliability value dictates or the scale down is causing your services to go unhealthy.

If you are running stateless applications (and you are ok with loosing session state on scale down), you can just add a new VMSS (durability bronze) and map it to an existing Nodetype. this offcourse does not work on primary nodetype (which can span only one VMSS)

Any how, a lot of available options are dependent on kinds of workload you have running. Garham, if you like to have a quick call to explore options - do send me an email. We can set up time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MicrosoftDocs/azure-docs/issues/6150#issuecomment-393223620, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNNP_hiS0WX75jhrn7eKa_k29STcIrkks5t3sZfgaJpZM4S3nmR .

aljo-microsoft commented 6 years ago

It sounds like you attempted to scale out your primary node type to 60, from 18, and then a second Azure Resource Manager deployment attempted to scale from 60 to 85 total VM's; before your initial scale out to 60 VM's deployment finished?

As mentioned previously this is the document that describes how to scaling out: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-scale-up-down

Which it seems like you have mostly understood as you are attempting it, but it doesn't appear you have completed capacity planning if you are scaling out again before you first completed?

Our documentation on capacity planning is here: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-capacity

To better help you achieve more, support options are defined here: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-support#report-production-issues-or-request-paid-support-for-azure

You'll notice I've submitted a PR to provide additional context to scaling out, which will be released after our review process completes.