Ansible 2.17 returns code 2 when some hosts are failing and rescued

ClausHolbechArista commented 5 months ago

Summary

When using block/rescue functionality in combination with run_once the ansible-playbook command returns code 2 when only some hosts are failing+rescued. When running the same playbook with --limit to only include a failing+rescued host it returns code 0. When running with --limit to only include a succeeding host it also returns code 0. Suspecting changes from #78680 to be causing this.

Issue Type

Bug Report

Component Name

ansible

Ansible Version

$ ansible --version
ansible [core 2.17.0]
  config file = None
  configured module search path = ['/home/holbech/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/holbech/.local/lib/python3.10/site-packages/ansible
  ansible collection location = /home/holbech/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/holbech/.local/bin/ansible
  python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python)
  jinja version = 3.1.4
  libyaml = True

Configuration

# if using a version older than ansible-core 2.12 you should omit the '-t all'
$ ansible-config dump --only-changed -t all
CONFIG_FILE() = None

OS / Environment

Ubuntu 22.04.4 LTS

Steps to Reproduce

Using a similar playbook as the integration test added in #78680

# inventory.yml
all:
  hosts:
    testhost:
    testhost2:

# playbook.yml
- hosts: testhost,testhost2
  gather_facts: false
  any_errors_fatal: true
  tasks:
    - block:
        - debug:
            msg: Some task running for all hosts
        - fail:
          run_once: true
        - name: any_errors_fatal fails all hosts when any of them fails
          debug:
            msg: SHOULD NOT HAPPEN
      rescue:
        - name: Rescues both hosts
          debug:
            msg: rescue
    - name: You can recover from fatal errors by adding a rescue section to the block.
      debug:
        msg: recovered

Expected Results

Expecting return code 0.

Example from running with ansible-core version 2.16.7

$ ansible-playbook -i inventory.yml playbook.yml ; echo $? 

PLAY [testhost,testhost2] ***************************************************************************************************************************************************************************************************************************************************************

TASK [debug] ****************************************************************************************************************************************************************************************************************************************************************************
ok: [testhost] => {
    "msg": "Some task running for all hosts"
}
ok: [testhost2] => {
    "msg": "Some task running for all hosts"
}

TASK [fail] *****************************************************************************************************************************************************************************************************************************************************************************
fatal: [testhost]: FAILED! => {"changed": false, "msg": "Failed as requested from task"}

TASK [Rescues both hosts] ***************************************************************************************************************************************************************************************************************************************************************
ok: [testhost] => {
    "msg": "rescue"
}
ok: [testhost2] => {
    "msg": "rescue"
}

TASK [You can recover from fatal errors by adding a rescue section to the block.] *******************************************************************************************************************************************************************************************************
ok: [testhost] => {
    "msg": "recovered"
}
ok: [testhost2] => {
    "msg": "recovered"
}

PLAY RECAP ******************************************************************************************************************************************************************************************************************************************************************************
testhost                   : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=1    ignored=0   
testhost2                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

0

The last 0 is the return code.

Actual Results

Getting return code 2.

Example from running with ansible-core version 2.17.0

$ ansible-playbook -i inventory.yml playbook.yml ; echo $? 

PLAY [testhost,testhost2] ***************************************************************************************************************************************************************************************************************************************************************

TASK [debug] ****************************************************************************************************************************************************************************************************************************************************************************
ok: [testhost] => {
    "msg": "Some task running for all hosts"
}
ok: [testhost2] => {
    "msg": "Some task running for all hosts"
}

TASK [fail] *****************************************************************************************************************************************************************************************************************************************************************************
fatal: [testhost]: FAILED! => {"changed": false, "msg": "Failed as requested from task"}

TASK [Rescues both hosts] ***************************************************************************************************************************************************************************************************************************************************************
ok: [testhost] => {
    "msg": "rescue"
}

TASK [You can recover from fatal errors by adding a rescue section to the block.] *******************************************************************************************************************************************************************************************************
ok: [testhost] => {
    "msg": "recovered"
}

PLAY RECAP ******************************************************************************************************************************************************************************************************************************************************************************
testhost                   : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=1    ignored=0   
testhost2                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

2

The last 2 is the return code.

Code of Conduct

[X] I agree to follow the Ansible Code of Conduct

ansibot commented 5 months ago

Files identified in the description:

bin/ansible

If these files are incorrect, please update the component name section of the description or use the component bot command.

mkrizek commented 5 months ago

TL;DR the following might be a quick fix:

diff --git a/lib/ansible/plugins/strategy/__init__.py b/lib/ansible/plugins/strategy/__init__.py
index efd69efe9b4..7665c8f6ff2 100644
--- a/lib/ansible/plugins/strategy/__init__.py
+++ b/lib/ansible/plugins/strategy/__init__.py
@@ -600,7 +600,7 @@ class StrategyBase:
                     # save the current state before failing it for later inspection
                     state_when_failed = iterator.get_state_for_host(original_host.name)
                     display.debug("marking %s as failed" % original_host.name)
-                    if original_task.run_once:
+                    if original_task.run_once and not original_task.any_errors_fatal:
                         # if we're using run_once, we have to fail every host here
                         for h in self._inventory.get_hosts(iterator._play.hosts):
                             if h.name not in self._tqm._unreachable_hosts:

The underlying issue however I believe is that both run_once and any_errors_fatal are similar features. In the linear strategy run_once implies any_errors_fatal on the run_once task. In https://github.com/ansible/ansible/pull/78680 we simplified the implementation of the any_errors_fatal feature but it resulted in scenarios, like this issue describes, where these two features conflict with each other and hosts are failed twice.

I believe that a part of the problem is that run_once and any_errors_fatal are implemented in different places, in the general results processing in StrategyBase and after the results are process in the linear strategy, respectively. Since host_pinned/free do not support neither any_errors_fatal or run_once it might make sense to move both to the linear strategy although the compatibility for 3rd party strategies would have to be taken into account. Completion of https://github.com/ansible/ansible/issues/73483 would be beneficial to moving this forward too - it seems that the number of issues/PRs linked to that issue are slowly increasing.

ClausHolbechArista commented 5 months ago

FWIW I see the same behavior (RC=0 on 2.16 and RC=2 on 2.17) without setting any_errors_fatal: true.

mkrizek commented 5 months ago

FWIW I see the same behavior (RC=0 on 2.16 and RC=2 on 2.17) without setting any_errors_fatal: true.

Yes, as I said:

In the linear strategy run_once implies any_errors_fatal on the run_once task.

so removing any_errors_fatal from the play (block, task) in such a case does not effectively change anything internally.

ansible / ansible