xCluster: Feature Request - Ability to suspend/resume node and drain roles in DSC

MaxFrost commented 7 years ago

Details of the scenario you tried and the problem that is occurring: Building a SQL failover cluster with S2D underpinning. I'm able to get the primary node configured and completely stood up, but because of the nature of S2D, some of the disk roles are assigned to the node where I'm deploying the addnode action of SQL, and the restarts are causing the storage to fail because they're associated with the node being restarted. I know the restarts are coming, so it would be nice to drain the roles ahead of time and then start them back up once my DSC installs are done.

The DSC configuration that is using the resource (as detailed as possible): Not going to paste the exact DSC as the existing DSC doesn't really cover the scenario.

Node1 { configure cluster (currently using a modified version that takes multiple node values, as s2d can't create disks until a second node is available) configure S2D and disks Install initial SQL cluster instance configure SQL settings }

Node2 { waits for cluster waits for SQL role checks for SQL settings This is where I would like to drain roles off of this node install Addnode <- mandatory restart This is where I would like to resume roles of this node configure FCI preferred owners/groups }

Version of the Operating System and PowerShell the DSC Target Node is running: Server 2016, Powershell 5.1

Version of the DSC module you're using, or 'dev' if you're using current dev branch: 1.7.0, but this is new functionality that's not available in current release.

johlju commented 7 years ago

When the SQL node is joined to the cluster (running xSQLServerSetup with the value 'Addnode' for the Action parameter) it's causing a restart of the node. For your scenario. Are you seeing this being a parameter for xCluster or a separate resource? Don't you need the roles on the node when you run "'Addnode'"?

MaxFrost commented 7 years ago

The SQL FCI roles are a bit odd, while they need to be installed on a cluster and the role itself (e.g. get-clustergroup -name SQLFailoverInstance) can be called from the secondary nodes, you can't move the group to a node until you've performed the AddNode SQL install on each node. Thus, being a role/group/resource owner is irrelevant to the SQL ADDNODE install action. All this installer cares about is that the server is a member in the cluster and that it has access to a CSV_NTFS/REFS volume.

The real problem here isn't the cluster itself or even SQL, it's the Storage Spaces Direct setup for the cluster volume that's impacted by the install, which is why the request to be able to "pause" or put a node into maintenance mode so that they're not actively hosting anything that needs HA. Unfortunately the rapid fire reboots caused by DSC impact S2D negatively in this case, and caused active cluster roles to fail because the storage didn't have a chance to replicate. This does self heal with enough time, but it causes a long enough outage to cause the original SQL instance to fail.

Part of this would be addressed automatically if failover clustering would drain roles before restart (specifically storage ownership) from a node if the OS received a user triggered restart or if the SQL installer didn't require a reboot after running ADDNODE, but that's not something that can be handled in the context of this resource.

I'm already working on doing this in a script resource, but thought it would be handy for other services out there that run on a windows failover cluster and would need to put a node into a "installer/reboot ready" state to maintain the integrity of the cluster during setup/enforcement, and then once the services are setup and reboots are finished, activate the node.

It's probably going to need to be a separate feature, somewhat like xWaitForCluster, except it somehow needs to run both before and after the install DSC, and the before action needs to be able to determine that anything ran between it and the after action is actually in compliance or not before bringing the node into maintenance mode to prevent the DSC from pausing the node over and over. Honestly a bit of a tangle and a good reason to say no to this request. I guess another method is to have this "wrap" the action that requires the node to be in that state, but I don't think there's any precedence of doing that nor am I sure that's a good idea.

This is being done on Azure DS2v2 instances using premium storage as a test bed, so not the slowest, nor the fastest machines.

MaxFrost commented 7 years ago

Oh, dear, just realized I specified xCluster. I should have said xFailOverCluster. Sorry.

MaxFrost commented 7 years ago

Bah, I think I'm going to pull this request back out. Was testing the script resource to effectively do the same thing, and pausing the node apparently breaks the install. This request for a feature wouldn't help my problem in the long run. Go ahead and close.

dsccommunity / FailoverClusterDsc

xCluster: Feature Request - Ability to suspend/resume node and drain roles in DSC #155