ansible / proposals

Repository for sharing and tracking progress on enhancement proposals for Ansible.
Creative Commons Zero v1.0 Universal
92 stars 19 forks source link

Abort further task execution when a task fails, with multiple hosts in play #182

Closed csuka closed 3 years ago

csuka commented 3 years ago

Proposal: Abort further task execution, with multiple hosts

Author: K Csuka

Date: 2020-09-08

Motivation

Feels like this feature should be in core design

Problems

In short:

"You are running a task, and let us say there are 10 hosts in your inventory. If any one of those hosts fail, within the executing batch, you want to prevent any further hosts that haven't been queued to not execute, and for the play to abort immediately on any single host failing." This should be usable in an Ansible Role. So using serial is not possible.

Background: I've been developing an Ansible role to configure a database. It is possible to setup a cluster with multiple hosts. The role also supports adding an arbiter.

The arbiter does not has the database application installed, but this host is required in the play.

Imagine three hosts. Host1 is the arbiter, host2 and host3 have the database application installed. Host2 and host3 are setup in a cluster. The hosts are setup by the Ansible role, and Ansible managed. Now, Ansible executes the role for a second/third/fourth/whatever time, and changes a config setting of the database. The database requires a rolling restart. Usually, one writes some thing along the lines of:

- template:
    src: db_config.j2
    dest: /etc/db_config
  register: db_config
  when: not arbiter

- service:
    name: db_application
    state: restarted
  throttle: 1
  when:
    - db_config.changed
    - not arbiter

The downside of this Ansible configuration, is that if the database application fails to start on host2 due to whatever reason, Ansible will also restart the database application on host3. And that is undesired, because if the database application fails on host3 as well, then the cluster is lost. So, for this specific task I'd like Ansible to stop/abort/skip other tasks if the database application has failed to start on a single host in the play.

Solution proposal

Create an ansible task parameter, which aborts further task execution on the other hosts if the task has failed on a host. Example: abort_on_fail

- hosts:
    - host1 # the arbiter
    - host2 # db
    - host3 # db
  tasks:
    - shell: /bin/false
       throttle: 1
       when: not arbiter
       abort_on_fail: true

Would output, in this specific scenario:

PLAY [host1, host2, host3] ***********************************************************************************************************

TASK [shell] *********************************************************************************************************
skipping: [host1]
FAILED: [host2]
skipping: [host3]
NO MORE HOSTS LEFT ***************************************************************************************************

PLAY RECAP ***********************************************************************************************************
host1                      : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
host2                      : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0
host3                      : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0  

Dependencies (optional)

Multiple hosts in play.

Testing (optional)

Skip further task execution, when a host fails a task, and the parameter abort_on_fail is set.

Documentation (optional)

It would make sense, yes.

Anything else?

There is a workaround to solve this issue. However, this is not an elegant solution. And this workaround does not work when the first host has to skip this task, due to the definition of 'run_once'.

- hosts: all
  gather_facts: no

  tasks:
    - shell: /bin/false
      loop: "{{ ansible_play_hosts }}"
      register: failing_task
      when: "failing_task | default({}) is not failed"
      delegate_to: "{{ item }}"
      run_once: true
bcoca commented 3 years ago

You can already do this with any_errors_fatal at play level or an meta: end_play task with a conditional.

csuka commented 3 years ago

You can already do this with any_errors_fatal at play level or an meta: end_play task with a conditional.

Unfortunately, it isn't as easy as that.

When using any_errors_fatal the task would still execute on all hosts. This is undesired. I'd like Ansible to stop/abort further execution when a task has failed on one host. And meta: end_play is it's 'own' task. I don't see how that could help me out, since the failing task still would've been executed on all hosts.

felixfontein commented 3 years ago

You could iterate over the hosts manually in a run_once task, and delegate to the host. Then you can use the following construct (I'm using the file module as a conditional fail):

---
- hosts: localhost
  tasks:
  - file:
      path: /tmpASDF
      # state: absent
      state: "{{ 'absent' if item != 'host-2' else 'asdf' }}"
    loop:
      - host-1
      - host-2
      - host-3
      - host-4
      - host-5
    when:
      - item != 'host-4' and (result is undefined or result is not failed)
    register: result

  - meta: end_play
    when: result is failed

  - debug:
      msg: Will not happen

In this case, host-4 is intentionally skipped, and host-2 always produces an error (except if the other state is commented in, then it always succeeds).

If you have 100s of hosts this will be very inefficient, but if you only have a couple of hosts, this should do what you want.

bcoca commented 3 years ago

any_errors_fatal + serial: 1 is what will behave as you expect

csuka commented 3 years ago

@bcoca please read my request carefully. I cannot use serial: 1, as I'm developing a role. Requiring serial: 1 could not be a requirement of using the role.

felixfontein commented 3 years ago

@bcoca that only works well if you can put that task in its own play. It's not possible to do that in a role.

@csuka I think I missed your workaround, which is pretty similar to mine :) For me, my approach also works when skipping the first item.

csuka commented 3 years ago

@felixfontein Ehm.. I don't quite get your playbook. You set hosts: localhost, which is not functional in a role. Also, you're statically defining hosts yourself, which should be done dynamically.

And I cannot get it working properly, can you perhaps share a pastebin of the desired outcome?

Besides these points, do you agree that this function should be a parameter available in Ansible?

felixfontein commented 3 years ago

@csuka I've created a small playbook which can run on any system without knowing anything about the hosts you have. You of course have to loop over the hosts, and add run_once and delegate_to accordingly. When running this over a group of hosts which include host_which_fails, host_which_is_skipped and some more, the playbook looks like:

---
- hosts: all
  tasks:
  - file:
      path: /tmpASDF
      # state: absent
      state: "{{ 'absent' if item != 'host_which_fails' else 'asdf' }}"
    run_once: yes
    delegate_to: "{{ item }}"
    loop: "{{ ansible_play_hosts }}"
    when:
      - item != 'host_which_is_skipped' and (result is undefined or result is not failed)
    register: result

  - meta: end_play
    when: result is failed

  - debug:
      msg: Will not happen
csuka commented 3 years ago

I don't get it, it happens. I quote myself ;

"There is a workaround to solve this issue. However, this is not an elegant solution. And this workaround does not work when the first host has to skip this task, due to the definition of 'run_once'."

host-01:
  host_which_fails: false
  host_which_is_skipped: true

host-02:
  host_which_fails: true
  host_which_is_skipped: false

host-03:
  host_which_fails: false
  host_which_is_skipped: false
TASK [debug] *********************************************************************************
Tuesday 08 September 2020  21:23:39 +0200 (0:00:01.026)       0:00:02.365 *****
ok: [test-multi-01] => {
    "msg": "Will not happen"
}
ok: [test-multi-02] => {
    "msg": "Will not happen"
}
ok: [test-multi-03] => {
    "msg": "Will not happen"
}

Hence my feature request....

felixfontein commented 3 years ago

Have you tried

---
- hosts: all
  tasks:
  - file:
      path: /tmpASDF
      # state: absent
      state: "{{ 'absent' if item != 'host-03' else 'asdf' }}"
    run_once: yes
    delegate_to: "{{ item }}"
    loop: "{{ ansible_play_hosts }}"
    when:
      - item != 'host-01' and (result is undefined or result is not failed)
    register: result

  - meta: end_play
    when: result is failed

  - debug:
      msg: Will not happen

instead of replacing strings with variables?

sivel commented 3 years ago

I'm not sure how likely it would be to see a feature such as abort_on_fail implemented. Under the current architecture of ansible, it would require that abort_on_fail force the task to also use throttle: 1, and add extra state checking specific to just that use case.

I think more realistically, this is something that could be implemented in a custom strategy plugin.

csuka commented 3 years ago

@sivel I disagree. There is no need to have the task force to use throttle: 1. One could also decide to use throttle: 2 for example. So simply checking for throttle is sufficient. It shouldn't be that hard, because Ansible already has this function for other parameters as well.

Again, this feature is desired to be used in a role. So using custom strategy plugins isn't desired. Don't we think this function has to be in Ansible Core?

@felixfontein besides the first host, the other hosts are running the same command, in this case /bin/false. Meaning the configuration you provided wouldn't suffice, right?

sivel commented 3 years ago

Maybe your request isn't specific enough, but what would be the value for abort_on_fail? Your example means that if any single host failed to execute a specific task, that the play would abort and prevent all other hosts from executing. Unfortunately, with throttle/serial > 1, there are already other hosts queued and executing, which means that we cannot ensure that we stop executing immediately, it would be "eventually abort if a currently executing host fails".

In which case, it sounds like setting any_errors_fatal at the task level is what you really want.

felixfontein commented 3 years ago

@csuka if I replace file: ... with shell: /bin/false it still works fine for me, even if the first host is skipped.

csuka commented 3 years ago

@sivel that's exactly what I don't want. The idea is indeed "abort if a currently executing host fails", on task level.

However, I got what I desire indeed working, with the following configuration:

# note that test-multi-01 has host_which_is_skipped: true
---
- hosts:
    - test-multi-01
    - test-multi-02
    - test-multi-03
  tasks:
    - set_fact:
        host_which_is_skipped: "{{ inventory_hostname }}"
      when: host_which_is_skipped

    - shell: /bin/false
      run_once: yes
      delegate_to: "{{ item }}"
      loop: "{{ ansible_play_hosts }}"
      when:
        - item != host_which_is_skipped
        - result is undefined or result is not failed
      register: result

    - meta: end_play
      when: result is failed

    - debug:
        msg: Will not happen

Thanks @felixfontein . Does this feature request still suffice? As in my opinion, this is still a feature I'd like to see in Ansible, and I do see the added value of it.

sivel commented 3 years ago

The idea is indeed "abort if a currently executing host fails", on task level.

You can use any_errors_fatal: true on the task level.

csuka commented 3 years ago

Right, my mistake. But then, the task still fails on all hosts. And does not stop execution when the task has failed on a single host.

sivel commented 3 years ago

Ok, I have had a really difficult time understanding exactly what you want. I'm still not sure I fully understand, so let me try to explain based on what I have randomly guessed at.

You are running a task, and let us say there are 10 hosts in your inventory. If any one of those hosts fail, within the executing batch, you want to prevent any further hosts that haven't been queued to not execute, and for the play to abort immediately on any single host failing?

Assuming the above is correct, I think any_errors_fatal could potentially be made to do that, without introducing a new keyword. Right now any_errors_fatal is only consulted at the end of a full task execution, and not as a short circuit based on the host batch.

csuka commented 3 years ago

@sivel I apologise for not explaining it properly or clear enough. Your 'random guess' is indeed correct. And I managed to do that with the tasks I've 2 comments ago. I'll edit my initial request with your info to clear things up.

bcoca commented 3 years ago

I'm going to close this for now as this seems an extremely rare corner case and more fitting to a custom strategy. If you want to revive this proposal feel free to bring it up in the core public irc meetings https://github.com/ansible/community/issues?q=is:open+label:meeting_agenda+label:core