ansible-collections / ansible.windows

Windows core collection for Ansible
https://galaxy.ansible.com/ansible/windows
GNU General Public License v3.0
254 stars 172 forks source link

win_updates get's a winrm timeout when a new network driver is installed. #190

Closed RobVerduijn closed 1 year ago

RobVerduijn commented 3 years ago
SUMMARY

When using win_updates and the windows update contains a network card driver update, the play breaks on a winrm timeout.

I can see in the logs that the update procedure succesfully completed, and reported which updates succeeded to install and which failed. However this information never reaches the ansible controller.

This happens when the network card driver is replaced in the windows system. If I force the network card update by hand on the windows system, then run the play again there is no problem.

Restarting the update for a second time after it failed will complete without errors.

I guess the network session stack is dumped on the windows side when the new network card driver is activated, and the linux side of the winrm session somehow fails to notice this.

There are no errors in the logs on either system. Only the winrm timeout on the ansible controller.

Ignore_errors: true does not help a recue section on a block does not help Is there a way to catch the winrm timeout ?

This was done windows 10 pro x64 guest on vmware. The vmware tools got an update for their vmxnet3 network card.

A simple workaround would be to tell vmware to update the tools before running the ansible play to update windows. But it is annoying.

ISSUE TYPE
COMPONENT NAME

community.windows.win_updates

ANSIBLE VERSION
ansible 2.9.18
  config file = /home/rob/code/local/ansible.cfg
  configured module search path = ['/home/rob/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules', '/home/rob/code/ansible-freeipa/plugins/modules']
  ansible python module location = /usr/lib/python3.9/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.9.2 (default, Feb 20 2021, 00:00:00) [GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
CONFIGURATION
pipelining = true
inventory = something else from default
OS / ENVIRONMENT

windows 10 pro

STEPS TO REPRODUCE

install a vmware win10 pro guest with old vmware tools 11.1.5-16724464 or older will do ensure you got no internet access during the installation to prevent auto updates once you logged in to the desktop enable internet access configure winrm access for ansible and ensure your win_updates categories contain drivers run a play that updates the win10 guest there will be a winrm timeout.

check the logs on the windows system to validate if the driver is updated if it is not updated, run the playbook again

playbook

---
- name: update windows hosts
  hosts: windows
  gather_facts: false
  remote_user: rob
  collections:
  - ansible.windows

  tasks:
  - name: test connectivity
    win_ping:

  - name: include update role
    include_role:
      name: update_windows

roles/update_windows/tasks/main.yml


---
- name: wait for connection
  wait_for_connection:

- name: ensure winrm service has delayed start
  ansible.windows.win_service:
    name: WinRM
    state: started
    start_mode: delayed

- block:
  - name: check for updates windows
    ansible.windows.win_updates:
      category_names:
      - Application
      - Connectors
      - CriticalUpdates
      - DefinitionUpdates
      - DeveloperKits
      - Drivers
      - FeaturePacks
      - Guidance
      - SecurityUpdates
      - ServicePacks
      - Tools
      - UpdateRollups
      - Updates
      - Upgrades
      state: installed
      log_path: c:\ansible_wu.txt
    register: win_update
    ignore_errors: true
  rescue:
  - name: set windows update failure
    set_fact:
      windows_update_fail: true

- name: reboot system
  ansible.windows.win_reboot:
    post_reboot_delay: '60'
    reboot_timeout: '3600'
  when:
  - win_update.reboot_required is defined
  - win_update.reboot_required | bool

<!--- HINT: You can paste gist.github.com links for larger files -->

##### EXPECTED RESULTS
<!--- Describe what you expected to happen when running the steps above -->
a reboot after the update if it is needed, and a win10 vm that has been updated
##### ACTUAL RESULTS
<!--- Describe what actually happened. If possible run with extra verbosity (-vvvv) -->
winrm timeout  (which is set to 600 seconds)
<!--- Paste verbatim command output between quotes -->
```paste below
jborean93 commented 3 years ago

This is really expected as when the network is dropped the connection will close and WinRM will stop any processes that has been spawned. I've recently changed the win_updates module in the latest release so the update process runs as a background task and Ansible will poll it for the result in separate connections. This should be enough to get it to withstand any network drops during an update installation but I cannot guarantee that will actually work as I never explicitly tested it.

If it doesn't work then the only real option available to you is to just block that particular update using reject_list and have it get installed manually.

RROZEK93 commented 3 years ago

@jborean93 Do you know which version was this implemented in ? Thanks

ducthanh1809 commented 3 years ago

I encountered the same issues, hope to see the new win_updates module be updated soon.

RobVerduijn commented 3 years ago

lly.

Hi, Using the reject_list as a workaround is not an option since the kb-article will always be a different one. Not installing the drivers updates is also not an option since this will not be limited to the network driver.

And sadly when using the latest version of win_update the winrm connection still dies with a timeout and it still breaks the play. ingnore errors does not prevent it and a rescue will not catch it.

Rob

agibson2 commented 3 years ago

The fix is to use the newest version of Ansible that switched to using a scheduled task to do the updates. The newer version packages the main ansible as ansible-core and the community stuff separately. I don't know what distro you using but EPEL just announced a week ago or so that their maintained version of Ansible that can be used on RHEL, Rocky Linux, Oracle Linux, AlmaLinux will be migrating from 2.9.x to the latest version sometime in the next few months. I am not sure if that is just for EL8 or if it also applies to EL7 though.

You can also remove Ansible from your system's packaging and manually install ansible-core using pip and ansible-galaxy for ansible.windows collection. I used the instructions for setting it up in the home directory using virtualenv which isolates the pip and ansible-galaxy installs to a user home directory so that it doesn't install into the system.

I am still using 2.9.x from EPEL myself and just don't include driver updates and do them manually but I plan to migrate when EPEL updates ansible to use ansible-core.

jborean93 commented 3 years ago

Based on the comments here this still seems to be an issue. The changes made recently converted the PR to use a scheduled task and Ansible just polls it for the output but unfortunately that's still not enough to fix this problem. More work needs to be done to make the action plugin handle dropped networks to solve this issue.

Unfortunately for now the only solution is to not install such updates which while mentioned isn't ideal or very feasible it's the only workaround.

agibson2 commented 3 years ago

This bug report is talking about 2.9. The updated code is in ansible-core and not 2.9. Is someone actually saying they are using the latest ansible-core version and experiencing the issue?

jborean93 commented 3 years ago

While the original report is for the code at 2.9 if they said they are using reject_list then they are using the new code as well. It does not surprise me that it cannot handle these drops as the code doesn't have any explicitly handling for this use case when polling the output.

RobVerduijn commented 3 years ago

hi, @jborean93 ... after another day of happy debugging on windows update implementation ..... I figured out a workaround in my play. In a block section, after the update installs the driver and ansible loses connection, I added a wait_for_connection statement with a serious timeout in my rescue section. After a long wait it finally picked up the connection again and the play finished,
To see what I did exactly see my play here It's a realy dirty play, but it's the best I could come up with to make sure it does the windows update-reboot procedure repeatedly untill it's completed.

@agibson2 please keep in mind that not everybody has the luxury of using pip to install ansible-core, some of us are required to work with the latest stable rpm version. (no rawhide) the best we can do is use the collection even though these have mixed results in combination with ansible 2.9

agibson2 commented 3 years ago

So does that mean you are using 2.9 but with ansible.windows collection that is for the new ansible-core? I didn't know that would even work. That might be the confusion then because the bug report is for 2.9 but the ansible.windows is newer.

EPEL is EOLing ansible 2.9 in a few months supposedly because Ansible 2.9 is going EOL upstream. That is why they are transitioning to ansible-core. I am looking forward to when that happens.

RobVerduijn commented 3 years ago

Well that makes 2 of us, writing playbooks that work on tower becomes increasingly more difficult. RedHat really leaves it's customers hanging when it comes to ansible updates and tower. We don't really have a choice but to try mix and match collections with ansible 2.9 considering we know most of it is unsupported and will very likely break.

Anyway enough whining, as far as I can tell this issue has a workaround now so you can close it.

Just wish redhat would release those official ansible-core rpm's I'm getting tired of this trial and error method.

Rob

jborean93 commented 3 years ago

Have you heard about execution environments, it's the mechanism that is supported by Red Hat going forward https://www.redhat.com/en/technologies/management/ansible/automation-execution-environments. There is an EE for 2.11 which supports this collection fully. I'm unsure how it all falls together but this is meant to be the way forward in terms of Red Hat support. All this and Ansible Controller (what Tower is now called) are all part of the new Ansible Automation Platform which covers Ansible beyond 2.9 which is the legacy setup.

If you need help with trying to navigate these new names and how to start your upgrade so you aren't stuck on 2.9 I recommend you reach out to Red Hat support.

RobVerduijn commented 3 years ago

Yes I did hear about that, and already managed to allocate time at wprk to look into the transition. However that does not seem to be a trivial change going from tower 3.8.3 to ansible automation platform 2.0 And yet I already know that 2.0 does not include the mesh solution that was promised 2 years ago, and since 2.0 is still early access it will probably take a lot longer. I just hope it doesn't require another overhaul of the entire environment.

RobVerduijn commented 2 years ago

Hello... We're currently in the middle of a migration to AAP 2.1.x (this was a rather complex exercise) So now I have the latest ee using the latest ansible.windows collection

and yet the issue is still present

I run the win_updates twice first time with ignore_unreachable: true so that my playbook does not break with an 'unreachable' error. There was no failure during the updates, it's just the vmware network driver update that fubars the winrm session ansible_core = 2.12.1 ansible.windows = 1.9.0

After the 'unreachable' failure the results were not registered. Because my play depends on the results of win_updates to be registered I run win_updates again with 'status: searched' NOT 'status: installed' because this will very likely yield an error. After which the results are registered and my playbook knows that it needs to reboot the server and run the updates again.

unreachable_error:

fatal: [w2k19]: UNREACHABLE! => {
    "changed": false,
    "failed_update_count": 0,
    "filtered_updates": {},
    "found_update_count": 0,
    "installed_update_count": 0,
    "invocation": {
        "module_args": {
            "accept_list": null,
            "category_names": [
                "Application",
                "Connectors",
                "CriticalUpdates",
                "DefinitionUpdates",
                "DeveloperKits",
                "Drivers",
                "FeaturePacks",
                "Guidance",
                "SecurityUpdates",
                "ServicePacks",
                "Tools",
                "UpdateRollups",
                "Updates",
                "Upgrades"
            ],
            "log_path": "c:\\ansible_wu.txt",
            "reboot": false,
            "reboot_timeout": 1200,
            "reject_list": null,
            "server_selection": "default",
            "skip_optional": false,
            "state": "installed",
            "use_scheduled_task": false
        }
    },
    "msg": "winrm connection error: HTTPSConnectionPool(host='192.168.122.166', port=5986): Read timed out. (read timeout=1200)",
    "skip_reason": "Host w2k19 is unreachable",
    "unreachable": true,
    "updates": {}
}

my workaround now:

        - name: Check for updates
          ansible.windows.win_updates:
            category_names: "{{ windows_update_category_names }}"
            state: installed
            log_path: c:\ansible_wu.txt
          register: win_update
          ignore_unreachable: true

        - name: Check for updates
          ansible.windows.win_updates:
            category_names: "{{ windows_update_category_names }}"
            state: searched
            log_path: c:\ansible_wu.txt
          register: win_update