Failed winRM after install forest on domain controller

dbroeglin / windows-lab

Windows Automated Lab with Vagrant

MIT License

82 stars 20 forks source link

Failed winRM after install forest on domain controller #1

Closed sjames-au closed 7 years ago

sjames-au commented 8 years ago

Hi,

Just wondering if anyone has had an issue with the provision failing due to winrm errors after the install forest script is executed. I am thinking it is just a matter of the winrm service needing to be told to be more patient while waiting for the reboot after this step, maybe something in my ~/.vagrant.d/Vagrantfile is needed?

Running with --debug doesnt give much detail but pasting the lines just before the error occurs in case they are useful.

Cheers,

Stewart


==> dc: -------             -------                  --------------              ------
==> dc: Operation comple... DCPromo.General.1                 False             Success
==> dc: Message             Context                  RebootRequired              Status
==> dc: -------             -------                  --------------              ------
==> dc: Operation comple... DCPromo.General.1                 False             Success
ERROR warden: Error occurred: An authorization error occurred while connecting to WinRM.

User: vagrant
Endpoint: http://127.0.0.1:5985/wsman
Message: WinRM::WinRMAuthorizationError
 INFO warden: Beginning recovery process...
 INFO warden: Recovery complete.
ERROR warden: Error occurred: An authorization error occurred while connecting to WinRM.

sjames-au commented 8 years ago

I caught up with a windows admin today. He advised the authentication error was likely as the changes to authentication sub system with the installation of the forest. I've tried tweaking various settings in the Vagrant file referenced in the vagrant winrm plugin but it just doesnt seem to be able to survive the reboot.

There is a chance that the original authentication negotiation is never refreshed, the change in authentication on the server would invalidate any NTLM hashes and the reboot would remove any kerberos tickets or also be invalidated by the forest installation.

Just for completeness I found it best to leave the forest install script causing a reboot and using:

vagrant up dc || vagrant reload dc --provision

as my one line, crap I need to refresh solution.

steffen-harbich-cognitum commented 7 years ago

I'm having the same issue. How did @dbroeglin make it work? Seems to be a general problem. Did you find a better solution @stootles?

dbroeglin commented 7 years ago

That is strange, vagrant uses a local user with credentials vagrant/vagrant to connect to the guest. It should not be affected by the DC promotion part.

I'm assuming you are using the master branch. I did not use that branch for a few months so I just retested. The only change I had to do is that I do no longer have a "ssh" version so I used this base image: https://dl.dropboxusercontent.com/u/937870/VMs/eval-win2012r2-standard-nocm-1.0.4.box

The result looks OK:

==> dc: Message             Context                  RebootRequired              Status
==> dc: Message             Context                  RebootRequired              Status
==> dc: -------             -------                  --------------              ------
==> dc: Operation comple... DCPromo.General.1                 False             Success
==> dc: Running provisioner: shell...
    dc: Running: provision/03_install_adfs.ps1 as c:\tmp\vagrant-shell.ps1
==> dc:     Directory: C:\
==> dc:
==> dc:

There is an error later on with the ADFS cert but that is expected. Did you try one of the other branches ? The one I worked most with is lab/iis-kerberos-authentication did you try it ? If the issue persists could you give me some details about your environment, pls ?

dbroeglin commented 7 years ago

I was cleaning up the branches and doing some documentation and everything worked fine. However, at some point, but only this once, when re-creating the lab I got a similar error:

==> dc01: action is required.
==> dc01: Message             Context                  RebootRequired              Status
==> dc01: -------             -------                  --------------              ------
==> dc01: Operation comple... DCPromo.General.1                 False             Success
==> dc01: Running provisioner: shell...
==> dc01: Forcing shutdown of VM...
==> dc01: Destroying VM and associated drives...
/opt/vagrant/embedded/gems/gems/winrm-1.3.6/lib/winrm/http/response_handler.rb:57:in `raise_if_auth_error': WinRM::WinRMAuthorizationError (WinRM::WinRMAuthorizationError)
    from /opt/vagrant/embedded/gems/gems/winrm-1.3.6/lib/winrm/http/response_handler.rb:50:in `raise_if_error'
    from /opt/vagrant/embedded/gems/gems/winrm-1.3.6/lib/winrm/http/response_handler.rb:35:in `parse_to_xml'
    from /opt/vagrant/embedded/gems/gems/winrm-1.3.6/lib/winrm/http/transport.rb:50:in `send_request'
    from /opt/vagrant/embedded/gems/gems/winrm-1.3.6/lib/winrm/winrm_service.rb:432:in `send_message'
    from /opt/vagrant/embedded/gems/gems/winrm-1.3.6/lib/winrm/winrm_service.rb:290:in `close_shell'
    from /opt/vagrant/embedded/gems/gems/winrm-fs-0.2.3/lib/winrm-fs/core/command_executor.rb:32:in `close'
    from /opt/vagrant/embedded/gems/gems/winrm-fs-0.2.3/lib/winrm-fs/core/upload_orchestrator.rb:74:in `ensure in with_command_executor'
    from /opt/vagrant/embedded/gems/gems/winrm-fs-0.2.3/lib/winrm-fs/core/upload_orchestrator.rb:74:in `with_command_executor'

But then the following vagrant up went as expected. Did you reproduce the issue several times ? It might be a timing issue that occurs only if something takes a bit longer than usual. In any case, as I indicated before. The user for the session is vagrant. It is not directly concerned by what happens in the domain we just created. winrm_service.rb:290 looks like it is an attempt to close the session so everything that needed to be done has already been done. If you or I can reproduce it I will open an issue with the vagrant guys.

steffen-harbich-cognitum commented 7 years ago

I did not use exactly your scripts but a very similar. The issue was reproducible but I needed to go on and decided to have 2 scripts manually invoked in the Windows Server 2012 machine that install AD etc. I then used the final image as new base box for vagrant as a workaround (and it is better anyway because the AD installation lasts some time). If it is a timing issue, maybe there is a timeout configuration that can be increased? I didn't find a configuration for that.

sjames-au commented 7 years ago

Sorry to have been silent on this. Travel, work, etc. I have never had a vagrant provisioning continue after the reboot. This was done frequently as I have been playing with rails/ruby and needed an AD environment, so was destroy/up as needed (or whenever my rails learning curve blew everything up). Just for reference I stopped unix/linux sysadmin work about 15 years ago and this is all in aide of getting a better understand for devops while I also learn a new language for no other reason than I wanted something new under my belt.

I just explored this from fresh builds and the latest master, no change. I created a fresh boxcutter image the same as reported earlier. I didnt want to build the entire lab, so "vagrant up dc01" is being used. The only change made was for the shared folders directory as I don't have the same one referenced in the Vagrantfile.

On the second attempt to build I (hand) timed wondering if this was caused by a timeout, from the time the "you will be logged out message" was 20 seconds, from "the bios screen" another 30 seconds (total 50 seconds)lapsed and vagrant threw the winrm auth error. It was another 1 minute 43 seconds before the box was logged in.

I then opened user accounts and advanced, I have no local accounts. The vagrant account is now a domain account. [and there is a message that on a DC you cant have local accounts in the local accounts snap-in]

By default windows does not audit failed logon events (as unbelievable as that sounds check cisecurity.org if you would like to validate). So I wont have a record in the logs of failed attempts.

I have ran a third time now, I realised I didn't validate I was running the latest virtual box (now running 5.0.26). No change.

Vagrant was updated to 1.8.6 and boxcutter updates were pulled. A fresh eval-win2012r2-standard-nocm-1.0.4 was created. Sadly bringing up dc01 had the same WinRM error as before.

I ran with --debug on the fifth run. I can see that there are quite a few reconnect attempts by vagrant. There is no timestampt against these but gives rise that perhaps my mid-2014 macbook pro doesn't reboot the VM as fast as @dbroeglin's machine is rebooting them. So his vagrant managed to snag a connection before the max number of retries or timeout occurs for the winRM login.

I hope that makes sense, still have a little jetlag.

dbroeglin commented 7 years ago

Thanks a lot for all that information and the various experiments. I learnt quite a bit about DCs just by reading your comment :-)

I just reran a build on my laptop (Mid 2014 too, core i7, 16Gb) and it works. My environment is:

vagrant: 1.8.1 with the following plugins: vagrant-share (1.1.5, system) vagrant-winrm (0.7.0) virtualbox: 5.0.20 r106931

I will upgrade vagrant and virtualbox and try again with my laptop loaded with some other stuff to make it slower.

Did you by any chance keep a dump of the --debug result ? I would like to look at it and compare it with what I get on my end.

dbroeglin commented 7 years ago

OK, I've still no idea why the error occurs, but I was able to reproduce the error you mentioned initially:

==> dc01: authoritative parent zone cannot be found or it does not run Windows DNS
==> dc01: server. If you are integrating with an existing DNS infrastructure, you should
==> dc01: manually create a delegation to this DNS server in the parent zone to ensure
==> dc01: reliable name resolution from outside the domain "lab.local". Otherwise, no
==> dc01: action is required.
==> dc01:
==> dc01: authoritative parent zone cannot be found or it does not run Windows DNS
==> dc01: server. If you are integrating with an existing DNS infrastructure, you should
==> dc01: manually create a delegation to this DNS server in the parent zone to ensure
==> dc01: reliable name resolution from outside the domain "lab.local". Otherwise, no
==> dc01: action is required.
INFO interface: info: Message             Context                  RebootRequired              Status
-------             -------                  --------------              ------
Operation comple... DCPromo.General.1                 False             Success

INFO interface: info: ==> dc01: Message             Context                  RebootRequired              Status
==> dc01: -------             -------                  --------------              ------
==> dc01: Operation comple... DCPromo.General.1                 False             Success
==> dc01: Message             Context                  RebootRequired              Status
==> dc01: -------             -------                  --------------              ------
==> dc01: Operation comple... DCPromo.General.1                 False             Success
ERROR warden: Error occurred: An authorization error occurred while connecting to WinRM.

User: vagrant
Endpoint: http://127.0.0.1:5985/wsman
Message: WinRM::WinRMAuthorizationError
INFO warden: Beginning recovery process...
INFO warden: Recovery complete.
ERROR warden: Error occurred: An authorization error occurred while connecting to WinRM.

User: vagrant
Endpoint: http://127.0.0.1:5985/wsman
Message: WinRM::WinRMAuthorizationError
INFO warden: Beginning recovery process...
INFO warden: Calling recover: #<Vagrant::Action::Builtin::HandleForwardedPortCollisions:0x00000101939070>
INFO warden: Recovery complete.
INFO warden: Beginning recovery process...
INFO warden: Recovery complete.
INFO warden: Beginning recovery process...
INFO warden: Recovery complete.
INFO warden: Beginning recovery process...
INFO warden: Recovery complete.
INFO warden: Beginning recovery process...
INFO warden: Recovery complete.
ERROR warden: Error occurred: An authorization error occurred while connecting to WinRM.

It then goes on and on and is never ever able to connect. I have no clue why this particular run failed. Maybe I should just schedule a reboot just after DC promotion and see if things work more smoothly. Any ideas ?

sjames-au commented 7 years ago

From research I did before the initial report I took away a distinct impression that a reboot during provisioning was not in scope for core vagrant. There are a number of attempts to create a plugin to allow it but I had no success at the time. As you had not referenced any plugins I figured it was perhaps something I didn't understand.

While our systems seem similar (my CPU is 2.2ghz BTW), I think we have a lovely race condition in play. Your VM is possibly booting faster, which could be anything from CPU/Disk speed through to local software installed or configuration.

I have tried some vagrant reload/reboot plugins. However. According to my windows guy, once you install the forest the SAM database is in an unstable state. Nothing can authenticate until a reboot occurs and the DC installation is finalised. He does not believe there is a way to login to a machine right after "install forest" without a reboot. So the reload plugins don't function because a fresh WinRM session can not be established (I am watching that happen right now using https://github.com/aidanns/vagrant-reload )

@steffen-harbich-itc has a great workaround listed above. Use something with the forest already installed as a base image for vagrant. Makes a lot of sense, though I am not experienced enough to know if it is realistic to change an installed DC's IP or hostname. It would work for my limited needs though - I have one DC and only ever one DC and it is usually the content of the DC that I need to refresh.

Reading through the winrm content of vagrant core (now they have merged winrm into core) it seems that they do not advertise extending the max_retries. Inspecting the source on github suggests there is a confirm.winrm.max_retries value and a retry delay value as well. Though I have been attempting to make changes to these settings but it doesnt seem to do anything. No displayed output to suggest they were changed, though, if I change 'username' it is changed. ( https://github.com/mitchellh/vagrant/blob/1f88a9737369eda3dede0d3aeb30928adf82a9f7/plugins/communicators/winrm/config.rb )

I'm out of time right now, but, will see check through the vagrant source and see if those settings are overridden elsewhere which might be why they are not referenced in the documentation.

sjames-au commented 7 years ago

Just a quick note.

Changing the max_tries limit in the winrm vagrant plugin does not change the results for me, running with --debug, I am not getting any more retries before failure.

dbroeglin commented 7 years ago

I did a few tries after having added Start-Sleep 120 at the end of 02_install_forest.ps1 and was not able to reproduce the error. I thought about that because the server seems to be doing some stuff asynchronously after the Install-ADDSForest has terminated. By waiting a bit we leave him time to finish that before trying to continue with the provisioning. Could you try it and see if it solves the issue ? It is kinda ugly but it may work.

dbroeglin commented 7 years ago

OK, I ran it about 10 times in a row on my loaded laptop and was not able to reproduce it. Let's hope it works the same on your side.

sjames-au commented 7 years ago

Nice workaround!

I have ran this about 10 times locally here and the Start-Sleep 120 appears to keep the winrm retry count within the limits of vagrant.

I was pretty excited to have the first run without a winrm error - not sure what that says about me.

I think we have identified the reason and a couple of good workarounds, it seems the original limitation of not being able to tell vagrant that a provisioning step will cause a reboot would be the root cause; Im not game enough to try and take of bite of that one.

dbroeglin commented 7 years ago

Cool ! I pushed the correction to master as c0879465a6c6a19c5c0b16ecb9c848bd73b24e0f

While testing I ran into another weird error: for some reason 02_install_forest.ps1 seems to be re-executed sometime during the pause. I could not figure out why, so I added a marker file to prevent execution. It did not seem to cause any real issue , but still it is cleaner like that.

Thanks everybody for your help!