ansible / proposals

Repository for sharing and tracking progress on enhancement proposals for Ansible.
Creative Commons Zero v1.0 Universal
93 stars 19 forks source link

New `max_fail` parameter for easier and more powerful allowing some hosts to fail #133

Open gdubicki opened 6 years ago

gdubicki commented 6 years ago

Proposal: New max_fail parameter for easier and more powerful allowing some hosts to fail

Author: Greg Dubicki <@gdubicki>

Date: 2018-07-23

Motivation

Make using ansible for rolling updates easier and more powerful by better configuration of allowing some hosts to fail, with a new max_fail parameter, in place of max_fail_percentage.

Problems

Currently used max_fail_percentage parameter is hard to use and limited for environments with variable number of hosts in hostgroups.

Let's say you have this configuration for a rolling update playbook:

serial:
- 1 # test batch
- 30%
- 30%
- 100% # to ensure all the rest host are processed in 4th batch

Let's also assume that you are using this code in 5 datacenters, with various number of nodes, from 100 to 30 nodes, and the current number varies because of growth on one side & and current node issues on the other.

Let's assume that your capacity allows you to assume that a rolling update with up to 5 nodes failed in case of both datacenters is a successful one.

Firstly of all the current max_fail_percentage parameter does not allow you to set global maximum failed nodes, but only per batch.

Secondly you would have to create percentage number specific to each datacenter an maintain that number in sync with current number of nodes in each one...

Solution proposal

Example code for problem specified in section above would be:

serial:
- 1 # test batch
- 30%
- 30%
- 100% # to ensure all the rest host are processed in 4th batch
max_fail: 5

...or, to ensure that we don't allow the only node in the first batch to fail but allow at most 2 nodes in all other batches to fail (note that this is NOT equivalent of above configuration):

serial:
- 1 # test batch
- 30%
- 30%
- 100% # to ensure all the remaining hosts are processed in 4th batch
max_fail:
- 0 # test batch has to pass
- 2
- 2
- 2

Testing (optional)

Unit tests to verify expected behaviour.

Documentation (optional)

Documentation update is needed.

gdubicki commented 6 years ago

Please let me know what you think about it because I would like to make a PR, but not if you object to the idea and it doesn't have a chance to get merged..

justinotherguy commented 3 years ago

Any progress on this one?

Would look forward to a solution, too.