chef / knife-windows

Plugin for Chef's knife tool for working with Windows nodes
Apache License 2.0
152 stars 110 forks source link

WinRM random timeouts on long chef-client runs? #431

Closed glennmate closed 7 years ago

glennmate commented 7 years ago

I'm wrapping a knife winrm command with Terraform to run chef-client on nodes. Here's the command: knife winrm 'azure-fqdn-dns-record.com' --winrm-shell elevated 'chef-client' -m -x 'domainname\\domainuser' -P 'domainpassword'"

To give insight into each node's configuration, here's the powershell script I'm running on them before I execute any WinRM commands from my local workstation (this isn't a prod environment):

winrm quickconfig -q
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/client '@{AllowUnencrypted="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
set-item wsman:\localhost\shell\maxmemorypershellmb 1024
set-item wsman:\localhost\MaxTimeoutms 900000
Start-Service WinRM
set-service WinRM -StartupType Automatic

And here's the WinRM config on the server (2016 Datacenter):

PS C:\Users\GlennMate> winrm get winrm/config
Config
    MaxEnvelopeSizekb = 500
    MaxTimeoutms = 900000
    MaxBatchItems = 32000
    MaxProviderRequests = 4294967295
    Client
        NetworkDelayms = 5000
        URLPrefix = wsman
        AllowUnencrypted = true
        Auth
            Basic = true
            Digest = true
            Kerberos = true
            Negotiate = true
            Certificate = true
            CredSSP = false
        DefaultPorts
            HTTP = 5985
            HTTPS = 5986
        TrustedHosts
    Service
        RootSDDL = O:NSG:BAD:P(A;;GA;;;BA)(A;;GR;;;IU)S:P(AU;FA;GA;;;WD)(AU;SA;GXGW;;;WD)
        MaxConcurrentOperations = 4294967295
        MaxConcurrentOperationsPerUser = 1500
        EnumerationTimeoutms = 240000
        MaxConnections = 300
        MaxPacketRetrievalTimeSeconds = 120
        AllowUnencrypted = true
        Auth
            Basic = true
            Kerberos = true
            Negotiate = true
            Certificate = false
            CredSSP = false
            CbtHardeningLevel = Relaxed
        DefaultPorts
            HTTP = 5985
            HTTPS = 5986
        IPv4Filter = *
        IPv6Filter = *
        EnableCompatibilityHttpListener = false
        EnableCompatibilityHttpsListener = false
        CertificateThumbprint
        AllowRemoteAccess = true
    Winrs
        AllowRemoteShellAccess = true
        IdleTimeout = 7200000
        MaxConcurrentUsers = 2147483647
        MaxShellRunTime = 2147483647
        MaxProcessesPerShell = 2147483647
        MaxMemoryPerShellMB = 1024
        MaxShellsPerUser = 2147483647

Here's the issue - my WinRM commands are throwing the following error suddenly for what appears to be no reason. My chef runs are long-running (first run can be 60 minutes). Halfway through the run I often get the following error:

ERROR: Failed to authenticate to <AZURE-DNS-FQDN> as $domain\$domainadminuser
Response: WinRM::WinRMAuthorizationError
Hint: Make sure to prefix domain usernames with the correct domain name.
Hint: Local user names should be prefixed with computer name or IP address.
EXAMPLE: my_domain\user_namer
ERROR: WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError

The interesting thing is that I authenticate and kick off the run, it's like I'm losing the connection? But the command I'm sending finishes in the background and the node converges successfully. I have the timeout set way longer than when WinRM seems to drop the connection. Any ideas?

mwrock commented 7 years ago

I don't think this is timeout related. That would not manifest itself as a WinRMAuthorizationError which is the result of receiving a 401 response from the end point. It is odd that this occurs well into the run as opposed to during the initial connection. Are you performing any operation during the chef-client run that changes the user's password or changes what users can access the machine? Or are you perhaps changing the domain? Those are things I would think could lead to the 401.

Unrelated but worthy of mention, on server 2016 you can remove all of your configuration:

winrm quickconfig -q
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/client '@{AllowUnencrypted="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
set-item wsman:\localhost\shell\maxmemorypershellmb 1024
set-item wsman:\localhost\MaxTimeoutms 900000
Start-Service WinRM
set-service WinRM -StartupType Automatic

Except for

set-item wsman:\localhost\MaxTimeoutms 900000
glennmate commented 7 years ago

Interesting note, I'll do some internal validation on this issue today and report findings. No changes are being made to the domain or user objects for which authentication via WinRM occurs. My Chef cookbooks are installing and managing SQL Server, WFC and an AOAG via PowerShell DSC.

mwrock commented 7 years ago

Just another thought: might be worth checking the event logs on the node and also running knife with -l debug for more clues.

glennmate commented 7 years ago

So I thought this might be a Terraform issue, but I have the same error when executing the commands directly from my workstation. The failure occurs when executing the below block of code in Chef:

dsc_resource "install_sql_#{node['mssql']['instance']}" do
# https://github.com/PowerShell/xSQLServer/blob/dev/DSCResources/MSFT_xSQLServerSetup/MSFT_xSQLServerSetup.schema.mof
  resource :xSQLServerSetup
  property :Features, node['mssql']['features']
  property :InstanceName, node['mssql']['instance']
  property :SecurityMode, 'SQL'
  property :SourcePath, "#{node['mssql']['datadriveletter']}:\\MSSQL\\Installer"
  property :UpdateEnabled, 'False'
  property :AgtSvcAccount, credentials
  property :FTSvcAccount, credentials
  property :SAPwd, credentials
  property :SetupCredential, credentials
  property :SQLSvcAccount, credentials
  property :InstallSharedDir, sqlshared_dir
  property :InstallSharedWOWDir, sqlsharedwow_dir
  property :InstanceDir, "#{node['mssql']['datadriveletter']}:\\MSSQL\\"
  property :InstallSQLDataDir, sqldata_dir
  property :SQLUserDBDir, sqldata_dir
  property :SQLUserDBLogDir, sqllog_dir
  property :SQLTempDBDir, sqltemp_dir
  property :SQLTempDBLogDir, sqltemp_dir
  property :SQLBackupDir, sqlbackup_dir
  property :BrowserSvcStartupType, "Automatic"
  property :sqlsysadminaccounts, node['mssql']['sysadmins']
  timeout 3600
  notifies :run, "powershell_script[restart_MSSQL_instance]", :immediately
end

For context, I am installing SQL 2016 SP1 Developer edition via a downloaded trial ISO onto a vanilla 2k16 VM in Azure. May be unrelated, but the username I'm using to authenticate with WinRM is also the username used in the pscredentials (which is the value of credentials). This is because I need to have chef run as this user context later to properly execute powershell_script blocks to configure my replica to join the failover cluster. I tried impersonating the context of the run with Chef powershell_script resource natives, but that throws a Windows token error about being able to impersonate. For context, this recipe works locally when running Chef-client on the machine.

What happens is that Chef via WinRM runs normally until this resource, then hangs at the console indefinitely, yet continues to run on the node. knife winrm bubbles up the error message when Chef is converged, but the Chef run actually completes successfully. Here's an example:

   * seven_zip_archive[SQL ISO] action extract
     - Extract F:\MSSQL\Installer/en_sql_server_2016_developer_with_service_pack_1_x64_dvd_9548071.iso => F:\MSSQL\Installer (overwrite=true)
   * dsc_resource[NET-Framework-Core] action run
     - Perform operation 'Invoke CimMethod' with following parameters, ''methodName' = Resourcetest,'className' =

     MSFT_DSCLocalConfigurationManager,'namespaceName' = root/Microsoft/Windows/DesiredStateConfiguration'.

     An LCM method call arrived from computer US-AZR-CWSQLA-1 with user sid S-1-5-21-3880659571-2398524605-2072332297-1106.

     [US-AZR-CWSQLA-1]: LCM:  [ Start  Test     ]  [[WindowsFeature]DirectResourceAccess]

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] The operation

     'Get-WindowsFeature' started: NET-Framework-Core

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] The operation

     'Get-WindowsFeature' succeeded: NET-Framework-Core

     [US-AZR-CWSQLA-1]: LCM:  [ End    Test     ]  [[WindowsFeature]DirectResourceAccess] False in 1.3280 seconds.

     [US-AZR-CWSQLA-1]: LCM:  [ End    Set      ]    in  1.8910 seconds.

     Operation 'Invoke CimMethod' complete.

     Time taken for configuration job to complete is 2.949 seconds

     Perform operation 'Invoke CimMethod' with following parameters, ''methodName' = Resourceset,'className' =

     MSFT_DSCLocalConfigurationManager,'namespaceName' = root/Microsoft/Windows/DesiredStateConfiguration'.

     An LCM method call arrived from computer US-AZR-CWSQLA-1 with user sid S-1-5-21-3880659571-2398524605-2072332297-1106.

     [US-AZR-CWSQLA-1]: LCM:  [ Start  Set      ]  [[WindowsFeature]DirectResourceAccess]

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] Installation started...

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] Continue with installation?

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] Prerequisite processing started...

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] Prerequisite processing succeeded.

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] Installation succeeded.

     [US-AZR-CWSQLA-1]:                            [[WindowsFeature]DirectResourceAccess] Successfully installed the

     feature NET-Framework-Core.

     [US-AZR-CWSQLA-1]: LCM:  [ End    Set      ]  [[WindowsFeature]DirectResourceAccess]  in 104.2570 seconds.

     [US-AZR-CWSQLA-1]: LCM:  [ End    Set      ]    in  104.3190 seconds.

     Operation 'Invoke CimMethod' complete.

     Time taken for configuration job to complete is 104.549 seconds

   * dsc_resource[install_sql_MSSQLSERVER] action run
ERROR: Failed to authenticate to <FQDN> as <DOMAIN>\<USER>
Response: WinRM::WinRMAuthorizationError
Hint: Make sure to prefix domain usernames with the correct domain name.
Hint: Local user names should be prefixed with computer name or IP address.
EXAMPLE: my_domain\user_namer
ERROR: WinRM::WinRMAuthorizationError: WinRM::WinRMAuthorizationError

So my thoughts: I'm going to change the WinRM authentication user to something that isn't the credentials user. See if this allows the DSC block to complete and knife winrm continue to bubble up stdout of the Chef run. If this doesn't work, then I'm somewhat at a loss. I need this command to exit with a 0 code, but I also don't want it to release my console until the Chef run is completed. I'll post back with my findings shortly. Combing through logs as well.

glennmate commented 7 years ago

Same issue kicking off chef via a different user. The WinRM run always disconnects at this resource. I am able to check logs in Event Viewer \ Application and Services Logs \ Microsoft \ Windows \ Windows Remote Management and see the shell opening and completing appropriately as the user context. There are no warnings or errors, and all informational logs are fairly standard (protocol handler, create session, operations, response handling, close sessions - rinse/repeat as needed). Chef logs also show no issues, with the Chef run initiating, going through the motions, and completing successfully.

This only occurs on the first run of the DSC block. Secondary runs do not disconnect. This tells me the xSQLServer resource is breaking something.

Thoughts on next steps? I'm going to look through other log locations for errors that may be contributing to this.

glennmate commented 7 years ago

@mwrock I'm guessing this is an issue with DSC and not knife-windows, so we can probably close this. I'm curious as to why WinRM throws that error, though, as that may assist in finding the root cause.

This process is VERY resource intensive and it may be a memory issue. I'm bumping my MaxMemoryPerShellMb to 4096 and testing.

mwrock commented 7 years ago

yeah it sounds like something specific to the resource. Also the default MaxMemoryPerShell on 2016 is effectively unlimited so unless GPO is changing things you should not need to tweak it. Might try the chef discource forums to see if anyone else has contended with this.

glennmate commented 7 years ago

It wasn't a memory issue, so I'll keep investigating. If I find the fix I'll report here in the event other users have a similar problem. Thanks for the help @mwrock.

glennmate commented 7 years ago

For any lingering internet users who stumble on this issue and find this of use, I worked around the issue (but did not solve). Rather than spend exhaustive time tearing apart the xSQLServerSetup resource of the xSQLServer PowerShell DSC module, I am running this command like so:

resource "null_resource" "primary_node_execute_chef" {
  depends_on            = ["null_resource.primary_node_configure_runlist"]

  provisioner "local-exec" {
      command = "knife winrm '${azurerm_public_ip.vm-sqlA-ip.fqdn}' --winrm-shell elevated 'chef-client' -m -x '${var.domain}\\${var.domainUsername}' -P '${var.domainPassword}' & set ERRORLEVEL=0"
  }
}

There are dangers to setting the error level, but gets the job done and with minimal present-day risk since we know the run is converging currently.