WinRb / WinRM

Ruby library for Windows Remote Management
Apache License 2.0
412 stars 117 forks source link

close and cleanup should retry on error and never raise net errors #262

Closed mwrock closed 7 years ago

mwrock commented 7 years ago

This addresses the issues raised in this comment. Vagrant, and perhaps other applications, used to wrap shell commands in its own retryable block that would intercept errors raised in open, command execution, cleanup or shell close. Winrm now only wraps shell open in a retryable block. If a command requests a system reboot, its posible that the cleanup or shell close operations fail and raise an exception in the hosting application.

This PR wraps both cleanup and close in their own retryable block. Further, if all retries fail, it will not raise the error but will send it to the log output.

Signed-off-by: Matt Wrock matt@mattwrock.com

dandunckelman commented 7 years ago

@mwrock I just updated to v1.9.4, which has v2.2.2 of winrm.

No errors like before, but I now get this (truncated above/below log entries):

453.540 ERR DEBUG winrmshell: [WinRM] remote shell created with shell_id: 00D03ABC-B2A9-4A40-981B-27AE97FEDD61
453.584 ERR DEBUG winrmshell: [WinRM] Command created for powershell -ExecutionPolicy Bypass -OutputFormat Text -file "c:\tmp\vagrant-shell.ps1"
453.584 ERR if (!$?) { if($LASTEXITCODE) { exit $LASTEXITCODE } else { exit 1 } } with id: 839F96E1-394A-42A8-AB6E-B5D2BB1535EE
453.584 ERR DEBUG winrmshell: [WinRM] creating command_id: 839F96E1-394A-42A8-AB6E-B5D2BB1535EE on shell_id 00D03ABC-B2A9-4A40-981B-27AE97FEDD61
453.584 ERR DEBUG winrmshell: [WinRM] Waiting for output...
454.517 ERR DEBUG winrmshell: [WinRM] Processing output
454.521 OUT ==> wincon: Node Script: windows-restart.ps1
454.526 OUT ==> wincon:     Hive: HKEY_LOCAL_MACHINE\Software\TechAccelerator
454.530 OUT ==> wincon: Name                           Property                                                                                
454.532 OUT ==> wincon: ----                           --------                                                                                
454.534 OUT ==> wincon: windows-restart.ps1            (default) : 1                                                                           
454.536 OUT ==> wincon: Sleeping for a bit.
454.543 ERR DEBUG winrmshell: [WinRM] Waiting for output...
514.637 ERR DEBUG winrmshell: [WinRM] Processing output
514.640 OUT ==> wincon: Restarting node.
514.642 OUT ==> wincon: Waiting for restart: 1
514.646 ERR DEBUG winrmshell: [WinRM] Waiting for output...
515.340 ERR DEBUG winrmshell: [WinRM] Processing output
515.350 ERR DEBUG winrmshell: [WinRM] Waiting for output...
515.363 ERR DEBUG winrmshell: [WinRM] cleaning up command_id: 839F96E1-394A-42A8-AB6E-B5D2BB1535EE on shell_id 00D03ABC-B2A9-4A40-981B-27AE97FEDD61
802.046 ERR  INFO winrmshell: [WinRM] Exception suppressed: [WSMAN ERROR CODE: 2150858843]: <f:WSManFault Code='2150858843' Machine='172.17.0.13' xmlns:f='http://schemas.microsoft.com/wbem/wsman/1/wsmanfault'><f:Message>The request for the Windows Remote Shell with ShellId 00D03ABC-B2A9-4A40-981B-27AE97FEDD61 failed because the shell was not found on the server. Possible causes are: the specified ShellId is incorrect or the shell no longer exists on the server. Provide the correct ShellId or create a new shell and retry the operation. </f:Message></f:WSManFault>
1087.754 ERR  INFO winrmshell: [WinRM] Exception suppressed: [WSMAN ERROR CODE: 2150858843]: <f:WSManFault Code='2150858843' Machine='172.17.0.13' xmlns:f='http://schemas.microsoft.com/wbem/wsman/1/wsmanfault'><f:Message>The request for the Windows Remote Shell with ShellId 00D03ABC-B2A9-4A40-981B-27AE97FEDD61 failed because the shell was not found on the server. Possible causes are: the specified ShellId is incorrect or the shell no longer exists on the server. Provide the correct ShellId or create a new shell and retry the operation. </f:Message></f:WSManFault>
1087.755 ERR ERROR warden: Error occurred: An error occurred executing a remote WinRM command.
1087.755 ERR Shell: Powershell
1087.755 ERR Command: powershell -ExecutionPolicy Bypass -OutputFormat Text -file "c:\tmp\vagrant-shell.ps1"
1087.755 ERR Message: Bad HTTP response returned from server. Body(if present): (500).
1087.755 ERR  INFO warden: Beginning recovery process...
1087.755 ERR  INFO warden: Recovery complete.
1087.755 ERR ERROR warden: Error occurred: An error occurred executing a remote WinRM command.
1087.755 ERR Shell: Powershell
1087.755 ERR Command: powershell -ExecutionPolicy Bypass -OutputFormat Text -file "c:\tmp\vagrant-shell.ps1"
1087.755 ERR Message: Bad HTTP response returned from server. Body(if present): (500).

You'll see in the logs that we run a script called windows-restart.ps1, which, as you can guess, restarts the VM.

This error makes sense in this example:

The request for the Windows Remote Shell with ShellId 00D03ABC-B2A9-4A40-981B-27AE97FEDD61 failed because the shell was not found on the server.

Possible causes are: the specified ShellId is incorrect or the shell no longer exists on the server

I doubt the shell can persist across a reboot...so, what would you advise we do in this case?

mwrock commented 7 years ago

A shell definitely can't persist across reboots, but I don't think thats the core problem here. Hard to tell from the output above, but I'm guessing that was a "suppressible" and not actually the one that caused vagrant to fail.

Thinking more on your use case, it may make the most sense to add the retryable to vagrant. All this PR did was make sure not to fail if the command cleanup or close failed due to network errors. In the case of a reboot, I'd expect the failure to occur in the command execution. At the lower WinRM gem level, I don't think we would want to suppress that. The higher level app may want to be aware and make decisions from it differently.

In the older versions of vagrant, the entire remote execution lifecycle from open to close was retryable. That would be ideal for commands that could result in a reboot and need to be retried.

I just ran through this myself on 1.9.4 and also noticed at least 3 other non winrm breakages but related to my driver (hyperv) that need fixing so I'll add this while I'm in there.

mwrock commented 7 years ago

When you have a chance and if its sharable, could you include a gist of your reboot script?

dandunckelman commented 7 years ago

@mwrock we are currently staying on Vagrant v1.8.1 because of this.

Here's the script: https://gist.github.com/dandunckelman/1302e0df3354d388a581dc2be94b798c

dandunckelman commented 7 years ago

@mwrock I just tried Vagrant 1.9.5 and got this error output:

248.251 ERR  INFO interface: detail: Running: script: windows-restart.ps1
248.251 ERR  INFO interface: detail:     wincon: Running: script: windows-restart.ps1
248.251 OUT     wincon: Running: script: windows-restart.ps1
248.252 ERR DEBUG winrm: powershell executing:
248.252 ERR powershell -ExecutionPolicy Bypass -OutputFormat Text -file "c:\tmp\vagrant-shell.ps1"
248.252 ERR DEBUG winrmshell: [WinRM] opening remote shell on http://172.17.0.11:55985/wsman
248.427 ERR DEBUG winrmshell: [WinRM] Waiting for output...
248.577 ERR DEBUG winrmshell: [WinRM] Processing output
248.579 ERR DEBUG winrmshell: [WinRM] polling for pipeline state. message: #<WinRM::PSRP::Message:0x00000000d5e6c0 @data="\xEF\xBB\xBF<Obj RefId=\"0\"><MS><Version N=\"protocolversion\">2.2</Version><Version N=\"PSVersion\">2.0</Version><Version N=\"SerializationVersion\">1.1.0.1</Version></MS></Obj>", @destination=1, @type=65538, @pipeline_id="00000000-0000-0000-0000-000000000000", @runspace_pool_id="00000000-0000-0000-0000-000000000000">
248.583 ERR DEBUG winrmshell: [WinRM] polling for pipeline state. message: #<WinRM::PSRP::Message:0x00000002ad2df0 @data="\xEF\xBB\xBF<Obj RefId=\"0\"><MS><Obj N=\"ApplicationPrivateData\" RefId=\"1\"><TN RefId=\"0\"><T>System.Management.Automation.PSPrimitiveDictionary</T><T>System.Collections.Hashtable</T><T>System.Object</T></TN><DCT><En><S N=\"Key\">DebugMode</S><I32 N=\"Value\">1</I32></En><En><S N=\"Key\">DebugStop</S><B N=\"Value\">false</B></En><En><S N=\"Key\">PSVersionTable</S><Obj N=\"Value\" RefId=\"2\"><TNRef RefId=\"0\" /><DCT><En><S N=\"Key\">PSVersion</S><Version N=\"Value\">2.0</Version></En><En><S N=\"Key\">PSCompatibleVersions</S><Obj N=\"Value\" RefId=\"3\"><TN RefId=\"1\"><T>System.Version[]</T><T>System.Array</T><T>System.Object</T></TN><LST><Version>1.0</Version><Version>2.0</Version><Version>3.0</Version><Version>4.0</Version></LST></Obj></En><En><S N=\"Key\">BuildVersion</S><Version N=\"Value\">6.3.9600.17400</Version></En><En><S N=\"Key\">CLRVersion</S><Version N=\"Value\">4.0.30319.42000</Version></En><En><S N=\"Key\">WSManStackVersion</S><Version N=\"Value\">3.0</Version></En><En><S N=\"Key\">PSRemotingProtocolVersion</S><Version N=\"Value\">2.2</Version></En><En><S N=\"Key\">SerializationVersion</S><Version N=\"Value\">1.1.0.1</Version></En></DCT></Obj></En><En><S N=\"Key\">DebugBreakpointCount</S><I32 N=\"Value\">0</I32></En></DCT></Obj></MS></Obj>", @destination=1, @type=135177, @pipeline_id="00000000-0000-0000-0000-000000000000", @runspace_pool_id="00000000-0000-0000-0000-000000000000">
248.583 ERR DEBUG winrmshell: [WinRM] Waiting for output...
248.599 ERR DEBUG winrmshell: [WinRM] Processing output
248.600 ERR DEBUG winrmshell: [WinRM] polling for pipeline state. message: #<WinRM::PSRP::Message:0x000000029cd798 @data="\xEF\xBB\xBF<Obj RefId=\"0\"><MS><I32 N=\"RunspaceState\">2</I32></MS></Obj>", @destination=1, @type=135173, @pipeline_id="00000000-0000-0000-0000-000000000000", @runspace_pool_id="00000000-0000-0000-0000-000000000000">
248.601 ERR DEBUG winrmshell: [WinRM] remote shell created with shell_id: BE59D4CD-D429-4084-B861-E5DDDA2C75C4
248.639 ERR DEBUG winrmshell: [WinRM] Command created for powershell -ExecutionPolicy Bypass -OutputFormat Text -file "c:\tmp\vagrant-shell.ps1"
248.640 ERR if (!$?) { if($LASTEXITCODE) { exit $LASTEXITCODE } else { exit 1 } } with id: 23D454A7-BFC6-4021-9609-2F51AE1D9441
248.640 ERR DEBUG winrmshell: [WinRM] creating command_id: 23D454A7-BFC6-4021-9609-2F51AE1D9441 on shell_id BE59D4CD-D429-4084-B861-E5DDDA2C75C4
248.640 ERR DEBUG winrmshell: [WinRM] Waiting for output...
249.566 ERR DEBUG winrmshell: [WinRM] Processing output
249.571 OUT ==> wincon: Node Script: windows-restart.ps1
249.575 OUT ==> wincon:     Hive: HKEY_LOCAL_MACHINE\Software\TechAccelerator
249.580 OUT ==> wincon: Name                           Property                                                                                
249.583 OUT ==> wincon: ----                           --------                                                                                
249.585 OUT ==> wincon: windows-restart.ps1            (default) : 1                                                                           
249.587 OUT ==> wincon: Sleeping for a bit.
249.594 ERR DEBUG winrmshell: [WinRM] Waiting for output...
309.669 ERR DEBUG winrmshell: [WinRM] Processing output
309.672 OUT ==> wincon: Restarting node.
309.674 OUT ==> wincon: Waiting for restart: 1
309.678 ERR DEBUG winrmshell: [WinRM] Waiting for output...
310.514 ERR DEBUG winrmshell: [WinRM] cleaning up command_id: 23D454A7-BFC6-4021-9609-2F51AE1D9441 on shell_id BE59D4CD-D429-4084-B861-E5DDDA2C75C4
310.535 ERR  INFO winrmshell: [WinRM] 500 returned in cleanup with error: Bad HTTP response returned from server. Body(if present): (500).
310.555 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.555 ERR  INFO warden: Beginning recovery process...
310.555 ERR  INFO warden: Recovery complete.
310.556 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.556 ERR  INFO warden: Beginning recovery process...
310.556 ERR  INFO warden: Recovery complete.
310.556 ERR  INFO warden: Beginning recovery process...
310.556 ERR  INFO warden: Recovery complete.
310.556 ERR  INFO warden: Beginning recovery process...
310.556 ERR  INFO warden: Recovery complete.
310.556 ERR  INFO warden: Beginning recovery process...
310.556 ERR  INFO warden: Recovery complete.
310.556 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.556 ERR  INFO warden: Beginning recovery process...
310.556 ERR  INFO warden: Recovery complete.
310.556 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.556 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Calling recover: #<Vagrant::Action::Builtin::Call:0x00000002ac4980>
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Calling recover: #<Vagrant::Action::Builtin::Call:0x0000000272d768>
310.557 ERR  INFO warden: Beginning recovery process...
310.557 ERR  INFO warden: Recovery complete.
310.557 ERR  INFO warden: Calling recover: #<Vagrant::Action::Builtin::Call:0x00000002773808>
310.557 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR ERROR warden: Error occurred: Bad HTTP response returned from server. Body(if present): (500).
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO warden: Beginning recovery process...
310.558 ERR  INFO warden: Recovery complete.
310.558 ERR  INFO environment: Released process lock: machine-action-bce3b5d4f2ca76d888b313880e0719ac
310.559 ERR  INFO environment: Running hook: environment_unload
310.560 ERR  INFO runner: Preparing hooks for middleware sequence...
310.561 ERR  INFO runner: 1 hooks defined.
310.561 ERR  INFO runner: Running action: environment_unload #<Vagrant::Action::Builder:0x0000000294ba18>
310.570 ERR /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/http/response_handler.rb:90:in `raise_transport_error': Bad HTTP response returned from server. Body(if present): (500). (WinRM::WinRMHTTPTransportError)
310.570 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/http/response_handler.rb:54:in `raise_if_error'
310.570 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/http/response_handler.rb:36:in `parse_to_xml'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/http/transport.rb:186:in `send_request'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/shells/power_shell.rb:44:in `close_shell'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/shells/base.rb:90:in `close'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/connection.rb:46:in `ensure in shell'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/winrm-2.2.3/lib/winrm/connection.rb:46:in `shell'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/vagrant-1.9.5/plugins/communicators/winrm/shell.rb:58:in `powershell'
310.571 ERR     from /opt/vagrant/embedded/gems/gems/vagrant-1.9.5/plugins/communicators/winrm/communicator.rb:148:in `execute'

It seems like re-adding retryable is the answer. Thoughts?

mwrock commented 7 years ago

In order to get the kind of reboot resiliency you are looking for, I do think we'd need to add the retryable back to vagrant. The winrm gem only retries the opening of the shell and not the entire command.

I do have mixed feelings on that. I hesitate because:

  1. While retrying on establishing the initial connection makes sense because that is where most obvious network errors will occur, network errors in the middle of a command may indicate other problems that need to be acted on and also requires that commands be idempotent.
  2. The SSH communicator also only retries the initial connection and not the command itself.

When it comes to reboot use cases, many bake that provisioning into the initial image with packer which has much better and specific reboot handling built in.

If we make the entire open/run/close cycle retryable again, this forces us into a model of opening a new shell for each command which has significant performance implications. Unfortunately we had to revert to this model recently because of the restrictive shell quotas in 2008R2 so thats not so huge, but prevents us from keeping a single shell open in the future.

These are just some thoughts I have on this.