xCluster: Automatic Offline Cluster Node Removal breaks Cluster Aware Updating and S2D workflows

jambar42 commented 3 years ago

Details of the scenario you tried and the problem that is occurring

When a Storage Spaces Direct cluster node is rebooted by Cluster Aware Updating, it is left in an offline state until the storage repair jobs are complete. If the DSC resource runs during this time, it removes the node from the failover cluster, and breaks the storage repair. The only way to get the node back into the cluster is to run Clear-ClusterNode.

Verbose logs showing the problem

The DSC configuration that is used to reproduce the issue (as detailed as possible)

The operating system the target node is running

OsName : Microsoft Windows Server 2019 Datacenter OsOperatingSystemSKU : DatacenterServerEdition OsArchitecture : 64-bit WindowsVersion : 1809 WindowsBuildLabEx : 17763.1.amd64fre.rs5_release.180914-1434 OsLanguage : en-US OsMuiLanguages : {en-US}

Version and build of PowerShell the target node is running

Name Value

PSVersion 5.1.17763.1852 PSEdition Desktop PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...} BuildVersion 10.0.17763.1852 CLRVersion 4.0.30319.42000 WSManStackVersion 3.0 PSRemotingProtocolVersion 2.3 SerializationVersion 1.1.0.1

Version of the DSC module that was used

1.16.0

jambar42 commented 3 years ago

I'll work on this item sometime over the next month.

jambar42 commented 3 years ago

https://github.com/dsccommunity/xFailOverCluster/blob/f4c289ae2e09d49c0a69bb081ab55f27c3cdd69e/source/DSCResources/MSFT_xCluster/MSFT_xCluster.psm1#L232

^^^ offending line of code

nickgw commented 2 years ago

@johlju I was coming to create an issue for this because my org has run into this issue as well. Do you have an opinion on whether we should scrap automatically kicking down nodes, or add a switch where we can enable not kicking the nodes.

Second option maintains current functionality, but imo automatically kicking downed nodes was a bad idea in the first place.

johlju commented 2 years ago

I think I rather see a switch that says KeepDownedNodesInCluster and when is $true it does not remove nodes. Then we don't make a breaking change.

nickgw commented 2 years ago

@johlju Made a new PR with KeepDownedNodesInCluster as a parameter!

dsccommunity / FailoverClusterDsc