Closed glennmate closed 7 years ago
I don't think this is timeout related. That would not manifest itself as a WinRMAuthorizationError which is the result of receiving a 401 response from the end point. It is odd that this occurs well into the run as opposed to during the initial connection. Are you performing any operation during the chef-client run that changes the user's password or changes what users can access the machine? Or are you perhaps changing the domain? Those are things I would think could lead to the 401.
Unrelated but worthy of mention, on server 2016 you can remove all of your configuration:
winrm quickconfig -q
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/client '@{AllowUnencrypted="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
set-item wsman:\localhost\shell\maxmemorypershellmb 1024
set-item wsman:\localhost\MaxTimeoutms 900000
Start-Service WinRM
set-service WinRM -StartupType Automatic
Except for
set-item wsman:\localhost\MaxTimeoutms 900000
Interesting note, I'll do some internal validation on this issue today and report findings. No changes are being made to the domain or user objects for which authentication via WinRM occurs. My Chef cookbooks are installing and managing SQL Server, WFC and an AOAG via PowerShell DSC.
Just another thought: might be worth checking the event logs on the node and also running knife with -l debug
for more clues.
So I thought this might be a Terraform issue, but I have the same error when executing the commands directly from my workstation. The failure occurs when executing the below block of code in Chef:
dsc_resource "install_sql_#{node['mssql']['instance']}" do
# https://github.com/PowerShell/xSQLServer/blob/dev/DSCResources/MSFT_xSQLServerSetup/MSFT_xSQLServerSetup.schema.mof
resource :xSQLServerSetup
property :Features, node['mssql']['features']
property :InstanceName, node['mssql']['instance']
property :SecurityMode, 'SQL'
property :SourcePath, "#{node['mssql']['datadriveletter']}:\\MSSQL\\Installer"
property :UpdateEnabled, 'False'
property :AgtSvcAccount, credentials
property :FTSvcAccount, credentials
property :SAPwd, credentials
property :SetupCredential, credentials
property :SQLSvcAccount, credentials
property :InstallSharedDir, sqlshared_dir
property :InstallSharedWOWDir, sqlsharedwow_dir
property :InstanceDir, "#{node['mssql']['datadriveletter']}:\\MSSQL\\"
property :InstallSQLDataDir, sqldata_dir
property :SQLUserDBDir, sqldata_dir
property :SQLUserDBLogDir, sqllog_dir
property :SQLTempDBDir, sqltemp_dir
property :SQLTempDBLogDir, sqltemp_dir
property :SQLBackupDir, sqlbackup_dir
property :BrowserSvcStartupType, "Automatic"
property :sqlsysadminaccounts, node['mssql']['sysadmins']
timeout 3600
notifies :run, "powershell_script[restart_MSSQL_instance]", :immediately
end
For context, I am installing SQL 2016 SP1 Developer edition via a downloaded trial ISO onto a vanilla 2k16 VM in Azure. May be unrelated, but the username I'm using to authenticate with WinRM is also the username used in the pscredentials (which is the value of credentials
). This is because I need to have chef run as this user context later to properly execute powershell_script
blocks to configure my replica to join the failover cluster. I tried impersonating the context of the run with Chef powershell_script
resource natives, but that throws a Windows token error about being able to impersonate. For context, this recipe works locally when running Chef-client on the machine.
What happens is that Chef via WinRM runs normally until this resource, then hangs at the console indefinitely, yet continues to run on the node. knife winrm bubbles up the error message when Chef is converged, but the Chef run actually completes successfully. Here's an example:
* seven_zip_archive[SQL ISO] action extract
- Extract F:\MSSQL\Installer/en_sql_server_2016_developer_with_service_pack_1_x64_dvd_9548071.iso => F:\MSSQL\Installer (overwrite=true)
* dsc_resource[NET-Framework-Core] action run
- Perform operation 'Invoke CimMethod' with following parameters, ''methodName' = Resourcetest,'className' =
MSFT_DSCLocalConfigurationManager,'namespaceName' = root/Microsoft/Windows/DesiredStateConfiguration'.
An LCM method call arrived from computer US-AZR-CWSQLA-1 with user sid S-1-5-21-3880659571-2398524605-2072332297-1106.
[US-AZR-CWSQLA-1]: LCM: [ Start Test ] [[WindowsFeature]DirectResourceAccess]
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] The operation
'Get-WindowsFeature' started: NET-Framework-Core
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] The operation
'Get-WindowsFeature' succeeded: NET-Framework-Core
[US-AZR-CWSQLA-1]: LCM: [ End Test ] [[WindowsFeature]DirectResourceAccess] False in 1.3280 seconds.
[US-AZR-CWSQLA-1]: LCM: [ End Set ] in 1.8910 seconds.
Operation 'Invoke CimMethod' complete.
Time taken for configuration job to complete is 2.949 seconds
Perform operation 'Invoke CimMethod' with following parameters, ''methodName' = Resourceset,'className' =
MSFT_DSCLocalConfigurationManager,'namespaceName' = root/Microsoft/Windows/DesiredStateConfiguration'.
An LCM method call arrived from computer US-AZR-CWSQLA-1 with user sid S-1-5-21-3880659571-2398524605-2072332297-1106.
[US-AZR-CWSQLA-1]: LCM: [ Start Set ] [[WindowsFeature]DirectResourceAccess]
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] Installation started...
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] Continue with installation?
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] Prerequisite processing started...
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] Prerequisite processing succeeded.
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] Installation succeeded.
[US-AZR-CWSQLA-1]: [[WindowsFeature]DirectResourceAccess] Successfully installed the
feature NET-Framework-Core.
[US-AZR-CWSQLA-1]: LCM: [ End Set ] [[WindowsFeature]DirectResourceAccess] in 104.2570 seconds.
[US-AZR-CWSQLA-1]: LCM: [ End Set ] in 104.3190 seconds.
Operation 'Invoke CimMethod' complete.
Time taken for configuration job to complete is 104.549 seconds
* dsc_resource[install_sql_MSSQLSERVER] action run
ERROR: Failed to authenticate to <FQDN> as <DOMAIN>\<USER>
Response: WinRM::WinRMAuthorizationError
Hint: Make sure to prefix domain usernames with the correct domain name.
Hint: Local user names should be prefixed with computer name or IP address.
EXAMPLE: my_domain\user_namer
ERROR: WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError
So my thoughts:
I'm going to change the WinRM authentication user to something that isn't the credentials
user. See if this allows the DSC block to complete and knife winrm
continue to bubble up stdout of the Chef run. If this doesn't work, then I'm somewhat at a loss. I need this command to exit with a 0 code, but I also don't want it to release my console until the Chef run is completed. I'll post back with my findings shortly. Combing through logs as well.
Same issue kicking off chef via a different user. The WinRM run always disconnects at this resource. I am able to check logs in Event Viewer \ Application and Services Logs \ Microsoft \ Windows \ Windows Remote Management and see the shell opening and completing appropriately as the user context. There are no warnings or errors, and all informational logs are fairly standard (protocol handler, create session, operations, response handling, close sessions - rinse/repeat as needed). Chef logs also show no issues, with the Chef run initiating, going through the motions, and completing successfully.
This only occurs on the first run of the DSC block. Secondary runs do not disconnect. This tells me the xSQLServer resource is breaking something.
Thoughts on next steps? I'm going to look through other log locations for errors that may be contributing to this.
@mwrock I'm guessing this is an issue with DSC and not knife-windows, so we can probably close this. I'm curious as to why WinRM throws that error, though, as that may assist in finding the root cause.
This process is VERY resource intensive and it may be a memory issue. I'm bumping my MaxMemoryPerShellMb
to 4096 and testing.
yeah it sounds like something specific to the resource. Also the default MaxMemoryPerShell on 2016 is effectively unlimited so unless GPO is changing things you should not need to tweak it. Might try the chef discource forums to see if anyone else has contended with this.
It wasn't a memory issue, so I'll keep investigating. If I find the fix I'll report here in the event other users have a similar problem. Thanks for the help @mwrock.
For any lingering internet users who stumble on this issue and find this of use, I worked around the issue (but did not solve). Rather than spend exhaustive time tearing apart the xSQLServerSetup resource of the xSQLServer PowerShell DSC module, I am running this command like so:
resource "null_resource" "primary_node_execute_chef" {
depends_on = ["null_resource.primary_node_configure_runlist"]
provisioner "local-exec" {
command = "knife winrm '${azurerm_public_ip.vm-sqlA-ip.fqdn}' --winrm-shell elevated 'chef-client' -m -x '${var.domain}\\${var.domainUsername}' -P '${var.domainPassword}' & set ERRORLEVEL=0"
}
}
There are dangers to setting the error level, but gets the job done and with minimal present-day risk since we know the run is converging currently.
I'm wrapping a
knife winrm
command with Terraform to run chef-client on nodes. Here's the command:knife winrm 'azure-fqdn-dns-record.com' --winrm-shell elevated 'chef-client' -m -x 'domainname\\domainuser' -P 'domainpassword'"
To give insight into each node's configuration, here's the powershell script I'm running on them before I execute any WinRM commands from my local workstation (this isn't a prod environment):
And here's the WinRM config on the server (2016 Datacenter):
Here's the issue - my WinRM commands are throwing the following error suddenly for what appears to be no reason. My chef runs are long-running (first run can be 60 minutes). Halfway through the run I often get the following error:
The interesting thing is that I authenticate and kick off the run, it's like I'm losing the connection? But the command I'm sending finishes in the background and the node converges successfully. I have the timeout set way longer than when WinRM seems to drop the connection. Any ideas?