dsccommunity / ActiveDirectoryDsc

This module contains DSC resources for deployment and configuration of Active Directory Domain Services.
MIT License
345 stars 142 forks source link

ADDomain fails after reboot #581

Closed lkt82 closed 4 years ago

lkt82 commented 4 years ago

Details of the scenario you tried and the problem that is occurring

After rebooting a node the state is reported as failed. The problem is that the ADDomain resource fails with "Server instance not found".

Eventually the node will reach the state of Compliant

Verbose logs showing the problem

"Exception":  {
    "Message":  "PowerShell DSC resource MSFT_ADDomain  failed to execute Test-TargetResource functionality with error message: System.InvalidOperationException: Error getting AD domain \u0027pensam.azure\u0027. (ADD0013) ---\u003e System.ArgumentException: Server instance not found on the given port. ---\u003e System.ServiceModel.FaultException: The operation failed because of a bad parameter.\r\n   --- End of inner exception stack trace ---\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowExceptionForFaultDetail(FaultDetail faultDetail, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowException(AdwsFault adwsFault, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.SearchAnObject(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.Search(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADWebServiceStoreAccess.Microsoft.ActiveDirectory.Management.IADSyncOperations.Search(ADSessionHandle handle, ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADObjectSearcher.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetConnectedStore()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetCmdletSessionInfo()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADGetCmdletBase`3.ADGetCmdletBaseProcessCSRoutine()\r\n   at Microsoft.ActiveDirectory.Management.CmdletSubroutinePipeline.Invoke()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.ProcessRecord()\r\n   --- End of inner exception stack trace --- ",
    "Data":  {},
    "InnerException":  {
        "ErrorRecord":  "System.InvalidOperationException: Error getting AD domain \u0027testad.azure\u0027. (ADD0013) ---\u003e System.ArgumentException: Server instance not found on the given port. ---\u003e System.ServiceModel.FaultException: The operation failed because of a bad parameter.\r\n   --- End of inner exception stack trace ---\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowExceptionForFaultDetail(FaultDetail faultDetail, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowException(AdwsFault adwsFault, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.SearchAnObject(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.Search(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADWebServiceStoreAccess.Microsoft.ActiveDirectory.Management.IADSyncOperations.Search(ADSessionHandle handle, ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADObjectSearcher.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetConnectedStore()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetCmdletSessionInfo()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADGetCmdletBase`3.ADGetCmdletBaseProcessCSRoutine()\r\n   at Microsoft.ActiveDirectory.Management.CmdletSubroutinePipeline.Invoke()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.ProcessRecord()\r\n   --- End of inner exception stack trace ---",
        "WasThrownFromThrowStatement":  true,
        "Message":  "System.InvalidOperationException: Error getting AD domain \u0027testad.azure\u0027. (ADD0013) ---\u003e System.ArgumentException: Server instance not found on the given port. ---\u003e System.ServiceModel.FaultException: The operation failed because of a bad parameter.\r\n   --- End of inner exception stack trace ---\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowExceptionForFaultDetail(FaultDetail faultDetail, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowException(AdwsFault adwsFault, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.SearchAnObject(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.Search(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADWebServiceStoreAccess.Microsoft.ActiveDirectory.Management.IADSyncOperations.Search(ADSessionHandle handle, ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADObjectSearcher.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetConnectedStore()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetCmdletSessionInfo()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADGetCmdletBase`3.ADGetCmdletBaseProcessCSRoutine()\r\n   at Microsoft.ActiveDirectory.Management.CmdletSubroutinePipeline.Invoke()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.ProcessRecord()\r\n   --- End of inner exception stack trace ---",
        "Data":  "System.Collections.ListDictionaryInternal",
        "InnerException":  "System.Exception: System.InvalidOperationException: Error getting AD domain \u0027testad.azure\u0027. (ADD0013) ---\u003e System.ArgumentException: Server instance not found on the given port. ---\u003e System.ServiceModel.FaultException: The operation failed because of a bad parameter.\r\n   --- End of inner exception stack trace ---\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowExceptionForFaultDetail(FaultDetail faultDetail, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.ThrowException(AdwsFault adwsFault, FaultException faultException)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.SearchAnObject(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.AdwsConnection.Search(ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADWebServiceStoreAccess.Microsoft.ActiveDirectory.Management.IADSyncOperations.Search(ADSessionHandle handle, ADSearchRequest request)\r\n   at Microsoft.ActiveDirectory.Management.ADObjectSearcher.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetRootDSE()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetConnectedStore()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.GetCmdletSessionInfo()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADGetCmdletBase`3.ADGetCmdletBaseProcessCSRoutine()\r\n   at Microsoft.ActiveDirectory.Management.CmdletSubroutinePipeline.Invoke()\r\n   at Microsoft.ActiveDirectory.Management.Commands.ADCmdletBase`1.ProcessRecord()\r\n   --- End of inner exception stack trace ---",
        "TargetSite":  "System.Collections.ObjectModel.Collection`1[System.Management.Automation.PSObject] Invoke(System.Collections.IEnumerable)",
        "StackTrace":  "   at System.Management.Automation.Runspaces.PipelineBase.Invoke(IEnumerable input)\r\n   at System.Management.Automation.PowerShell.Worker.ConstructPipelineAndDoWork(Runspace rs, Boolean performSyncInvoke)\r\n   at System.Management.Automation.PowerShell.Worker.CreateRunspaceIfNeededAndDoWork(Runspace rsToUse, Boolean isSync)\r\n   at System.Management.Automation.PowerShell.CoreInvokeHelper[TInput,TOutput](PSDataCollection`1 input, PSDataCollection`1 output, PSInvocationSettings settings)\r\n   at System.Management.Automation.PowerShell.CoreInvoke[TInput,TOutput](PSDataCollection`1 input, PSDataCollection`1 output, PSInvocationSettings settings)\r\n   at System.Management.Automation.PowerShell.Invoke(IEnumerable input, PSInvocationSettings settings)\r\n   at Microsoft.PowerShell.DesiredStateConfiguration.Internal.ResourceProviderAdapter.ExecuteCommand(PowerShell powerShell, ResourceModuleInfo resInfo, String operationCmd, List`1 acceptedProperties, CimInstance nonResourcePropeties, CimInstance resourceConfiguration, LCMDebugMode debugMode, PSInvocationSettings pSInvocationSettings, UInt32\u0026 resultStatusHandle, Collection`1\u0026 result, ErrorRecord\u0026 errorRecord, PSModuleInfo localRunSpaceModuleInfo)",
        "HelpLink":  null,
        "Source":  "System.Management.Automation",
        "HResult":  -2146233087
    },
    "TargetSite":  null,
    "StackTrace":  null,
    "HelpLink":  null,
    "Source":  null,
    "HResult":  -2146233079
},
"TargetObject":  null,
"CategoryInfo":  {
    "Category":  7,
    "Activity":  "",
    "Reason":  "InvalidOperationException",
    "TargetName":  "",
    "TargetType":  ""
},
"FullyQualifiedErrorId":  "ProviderOperationExecutionFailure",
"ErrorDetails":  null,
"InvocationInfo":  null,
"ScriptStackTrace":  null,
"PipelineIterationInfo":  []

Suggested solution to the issue

Detect that the node is starting and wait for dependencies to be ready

The DSC configuration that is used to reproduce the issue (as detailed as possible)

Node $AllNodes.NodeName
{

    TimeZone TimeZone { IsSingleInstance = 'Yes'; TimeZone = 'W. Europe Standard Time'}

    WaitForDisk DataDisk { DiskId = 2; RetryIntervalSec = 60; RetryCount = 60 }

    Disk DataVolume { DiskId = 2; DriveLetter = $dataDriveLetter; Size = 127.98GB; DependsOn = '[WaitForDisk]DataDisk' }

    WindowsFeature ADDS  {  Name = "AD-Domain-Services" }  

    ADDomain AD
    {
        DomainName                    = $DomainName
        Credential                    = $AdminCreds
        SafemodeAdministratorPassword = $SafeModeAdminCreds
        DomainNetBiosName             = $domainNetbiosName
        ForestMode                    = 'WinThreshold'
        DatabasePath                  = "${dataDriveLetter}:\NTDS"
        LogPath                       = "${dataDriveLetter}:\Logs"
        SysvolPath                    = "${dataDriveLetter}:\SYSVOL"
        DependsOn = '[Disk]DataVolume'
    }

    RemoteDesktopAdmin RemoteDesktopSettings
    {
        IsSingleInstance   = 'yes'
        Ensure             = 'Present'
        UserAuthentication = 'NonSecure'

        DependsOn = "[ADDomain]AD"
    }

    PendingReboot Reboot
    {
        Name = "Prior to installing tools"
        DependsOn = "[ADDomain]AD"
    }

    WaitForADDomain Ready
    {
        DomainName                    = $DomainName
        Credential                    = $AdminCreds
        DependsOn = "[PendingReboot]Reboot"
    }

    WindowsFeature RSAT 
    { 
        Ensure = "Present" 
        Name = "RSAT"

        DependsOn = "[WaitForADDomain]Ready"
    }

    WindowsFeature AdminCenter 
    { 
        Ensure = "Present" 
        Name = "RSAT-AD-AdminCenter"

        DependsOn = "[WindowsFeature]RSAT"
    }

    ADOptionalFeature RecycleBin
    {
        FeatureName                       = "Recycle Bin Feature"
        EnterpriseAdministratorCredential = $AdminCreds
        ForestFQDN                        = $DomainName
    }
}

The operating system the target node is running

OsName               : Microsoft Windows Server 2019 Datacenter
OsOperatingSystemSKU : DatacenterServerEdition
OsArchitecture       : 64-bit
WindowsVersion       : 1809
WindowsBuildLabEx    : 17763.1.amd64fre.rs5_release.180914-1434
OsLanguage           : en-US
OsMuiLanguages       : {en-US}

Version and build of PowerShell the target node is running

PSVersion                      5.1.17763.1007
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.17763.1007
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

Version of the DSC module that was used

PSDesiredStateConfiguration 2.12.0
ActiveDirectoryDsc  6.0.0
ComputerManagementDsc 8.0.0
StorageDsc 4.9.0.0
X-Guardian commented 4 years ago

hi @lkt82, thanks for reporting this. This is a similar issue to #574, and we added an additional exception to the try/catch block around Get-AdDomain for System.InvalidOperationException in PR #577 to solve this. This was released in v6.0.0 of the module. Are you a !00% sure you are using v6.0.0 of the ActiveDirectoryDsc module, not one of the pre-release versions?

lkt82 commented 4 years ago

I have just tried to reprovision a new node. Waited until DSC was completed and the ran a manuel reboot. The node reports the same error as described.

output from get module on the node

Get-Module -ListAvailable -Name ActiveDirectoryDsc

Directory: C:\Program Files\WindowsPowerShell\Modules

ModuleType Version    Name                                ExportedCommands
---------- -------    ----                                ----------------
Manifest   6.0.0      ActiveDirectoryDsc                  Find-DomainController
johlju commented 4 years ago

@lkt82 Can you add verbose messages too to better see what is happening. See this https://github.com/dsccommunity/ActiveDirectoryDsc/issues/574#issuecomment-588420838, but the added code should be before line 113, as the first thing in the catch-block.

erictorbenson commented 4 years ago

@johlju and @X-Guardian -- I just ran into this when trying to provision a brand new domain/DC in Azure. I triple-checked that I was using the most current ActiveDirectoryDSC (6.0.0) and am still stuck. The DSC gets down to the VM fine, the domain build happens, but the "domain not found" exception is triggered after the first reboot, and kills my ARM template deployment.

One thing I did find is that in MSFT_ADDomain.ps1, starting at the do loop on Line 96...whatever exception is being triggered doesn't fall under the 3 types specifically called out. Instead it triggers the generic catch, calls "New-InvalidOperationException" -- so it never falls into the retry behavior that I was used to in previous versions of this DSC module. (Azure DCs would always take a few extra minutes to come up after the first install, and the code was able to handle that by retrying several times.) It looks like whatever I'm hitting triggers a System.ArgumentException.

I'm attaching the log showing the verbose output of the error -- it's the same one everyone else is getting ("Server instance not found on the given port") DscExtensionHandler.0.20200405-201642.txt

X-Guardian commented 4 years ago

Hi @erictorbenson. Can you add the following verbose messages at line 113 of MSFT_ADDomain.psm1 (within the GetAdDomainUnexpectedError catch block) so that we can see full details of the exception being thrown.

Write-Verbose -Message "ErrorRecord Details: $($_|fl -Force|Out-String)" -Verbose
Write-Verbose -Message "Exception Details: $($_.Exception|fl -Force|Out-String)" -Verbose
Write-Verbose -Message "Inner Exception Details: $($_.Exception.InnerException|fl -Force|Out-String)" -Verbose
Write-Verbose ('Exception Name: ' + $_.Exception.GetType().FullName) -Verbose
Write-Verbose ('Inner Exception Name: ' + $_.Exception.InnerException.GetType().FullName) -Verbose
erictorbenson commented 4 years ago

Hi @erictorbenson. Can you add the following verbose messages at line 113 of MSFT_ADDomain.psm1 (within the GetAdDomainUnexpectedError catch block) so that we can see full details of the exception being thrown.

Done! (And I learned that Publish-AzVMDSCConfiguration will download the latest dependencies instead of using what you have in the folder...after running through a whole build. :-)

File is attached. Interestingly I found a positional parameter error: "[ERROR] A positional parameter cannot be found that accepts argument '+'." (I replaced the actual domain name with "redactedfqdn")

Any help would be appreciated...everything in the DSC resource looks OK so it might be something it's calling externally?

DscExtensionHandler.0.20200406-212544.txt

johlju commented 4 years ago

@erictorbenson it was a bug in the snippet that @X-Guardian provided above that you hit instead of the actual error we are looking for. It didn't like the + sign in the last two rows :/. Sorry, but can you change to the following snippet and run again? 🙏

I verified that this snippet below will output what we are looking for. Last two rows has parentheses now:

Write-Verbose -Message "ErrorRecord Details: $($_|fl -Force|Out-String)" -Verbose
Write-Verbose -Message "Exception Details: $($_.Exception|fl -Force|Out-String)" -Verbose
Write-Verbose -Message "Inner Exception Details: $($_.Exception.InnerException|fl -Force|Out-String)" -Verbose
Write-Verbose ('Exception Name: ' + $_.Exception.GetType().FullName) -Verbose
Write-Verbose ('Inner Exception Name: ' + $_.Exception.InnerException.GetType().FullName) -Verbose
erictorbenson commented 4 years ago

Sorry, but can you change to the following snippet and run again? 🙏

No problem, I'm happy that someone is able to look at this because it's holding up a project for me!

I do have another clue to contribute though...it took me 3 tries to reproduce this. It has to have something to do with timing because successful attempts worked flawlessly. Maybe it's a startup-order thing where DSC is starting before the basic AD services? It takes a good 4 or 5 minutes for new DCs to start in Azure (and on prem too) after the first reboot. Sorry about the vague issue reporting...I'm new to DSC and everything I've done so far with it has worked pretty well, so I haven't had time to dig into the internals.

DscExtensionHandler.0.20200407-180024.log

X-Guardian commented 4 years ago

Thanks @erictorbenson, It looks like we need to add System.ArgumentException to the list of retry exceptions in the resource.

If you want to prove this yourself, you just need to add [System.ArgumentException] to the list of exceptions in the catch statement at line 103 of the MSFT_ADDomain.psm1 file in the module.

johlju commented 4 years ago

This will soon be published as a preview release. As soon as the pipeline finishes running.

erictorbenson commented 4 years ago

Update for anyone who finds this later via search...Adding the exception to the list worked! It only needed the one retry.

One thing I did notice is that everything is timing after the first reboot. After I moved on from this, other DSC configuration resources I had built (creating reverse lookup zone in DNS, enabling AD Recycle Bin, etc.) would also fail with similar "can't find the domain" errors. If you actually watch the DC build in Azure, it'll take several minutes before the "configuring settings" spinning-dots screen goes away. I assume all of the AD cmdlets need something that initializes later.

A "workaround" that let me move on is a simple delay that all the future resources depend on -- in my case it took a 5 minute delay before everything worked correctly. (Obviously this could be a lot more refined, but it does work.)

            Script WaitForDCToSettleDown {
                DependsOn = "[ADDomain]ADDomain"
                SetScript = {Start-Sleep -Seconds 300}
                GetScript = {write-host "Get"}
                TestScript = {$false}
            }
johlju commented 4 years ago

@erictorbenson It doesn't work having the WaitForADDomain that waits for the domain, you need the above Script resource on top of the below? 🤔 Or do you mean that WaitForADDomain fails like the others because of a similar issue (or the same) as this issue discusses?

WaitForADDomain Ready
    {
        DomainName                    = $DomainName
        Credential                    = $AdminCreds
        DependsOn = "[PendingReboot]Reboot"
    }
erictorbenson commented 4 years ago

Actually I forgot about WaitForADDomain. :-) The first thing that came into my head to fix the problem fast was a Script that sleeps.

I just redid everything and WaitForADDomain works, BUT, I did have to increase the timeout to 300 seconds, otherwise DSC reboots the DC before it gets a chance to respond and you run the risk of never initializing the DC the whole way before the restarts expire.

Here's what the DC build part of my DSC config looks like now...this looks like a solid way to ensure we don't time out on any of the other AD-dependent elements. (The next element in the config needs to depend on [WaitForADDomain]FirstBoot.)

Thanks @johlju @X-Guardian for the fast help for a relative DSC newbie. Hopefully I can contribute at some point.

            ADDomain ADDomain {
                Credential                    = $DomainCreds
                SafeModeAdministratorPassword = $DsrmCreds
                DomainNetBiosName             = $netbiosDomainName
                DomainName                    = $DomainName
                ForestMode                    = 'WinThreshold'
                DomainMode                    = 'WinThreshold'
                DatabasePath                  = $ntdsDBPath
                LogPath                       = $ntdsLogPath
                SysvolPath                    = $sysvolPath
                DependsOn                     = @("[WindowsFeature]AD-Domain-Services", "[WindowsFeature]DNSServer", "[Disk]DataDisk")
            }

            WaitForADDomain FirstBoot {
                DomainName = $DomainName
                WaitForValidCredentials = $true
                Credential = $domainAdminCredentials
                PsDscRunAsCredential = $domainAdminCredentials
                WaitTimeout = 300
                RestartCount = 3
                DependsOn = "[ADDomain]ADDomain"
            }