ansible / proposals

Repository for sharing and tracking progress on enhancement proposals for Ansible.
Creative Commons Zero v1.0 Universal
93 stars 19 forks source link

Limit number of concurrent executions for a single task #129

Closed t-woerner closed 3 years ago

t-woerner commented 6 years ago

Proposal: Limit number of concurrent executions for a single task

Author: Thomas Woerner IRC: twoerner

Date: 2018-07-16

Motivation

In a playbook with lots of tasks there might be one or more tasks that do have issues with parallel execution because of access limitations or conflicts while executing the task.

For FreeIPA replica deployments we have exactly this issue. There is one task out of more than thirty that can not be executed more than once or twice in parallel right now. This is due to a access limitation on the server side and also a possible conflict while being executed in parallel.

The limitation of the whole playbook execution to use one worker with forks:1 will result in a very long execution time. The remaining tasks can be executed in parallel.

Problems

Solution proposal

Add a new attribute like max_concurent to task that will limit the number of concurrent executions of the current task. Add an additional check to StrategyBase._queue_task to reset the current worker id to 0 if task.max_concurrent is greater than 0 and if the current worker id is greater than task.max_concurrent. This is the same that is done for the number of the global forks setting (StrategyBase._workers). StrategyBase._workers contains forks amount of workers. max_concurrent is a task specific version of forks.

max_concurrent is not able to increase the number of workers that are used to process the taks in the playbook. It is only able to limit the already defined numbers of workers.

The limitation of the concurrent task executions in the pull request is done in the same way as the single tasks are attached to the available workers. There should not be a behaviour change.

What is the expected behavior for nonlinear strategies?

With strategy:free the expected behaviour should be very similar as with linear: As long as there is one task in the list of tasks that are currently handled or queued, the number of used workers is limited to max_concurrent. With free other tasks that are executed at the same time as a task using max_concurrent are affected as well.

What is the relationship to any_errors_fatal/play serail/run_once/etc ?

The only relationship I see is that any_errors_fatal and max_concurrent are attributes for task. There should not be a change with max_concurrent as max_concurrent should not change any error handling and behavior.

The relationship to serial is that serial is doing something similar, but on the playbook level in a invasive way. It creates serialized batches and then uses TaskQueueManager to run the playbook for each batch. There is no more relationship than doing something similar but on a different level.

run_once is running a task only on one host out of play_hosts. This is useful if you can expect exactly the same results from all hosts in play_hosts. max_concurrent on the other side is not limiting the execution to max_concurrent hosts. The task is executed on all hosts finally, but only by max_concurrent at the same time.

forks is serializing the playbook execution to forks hosts at a time. It is similar to max_concurrent, but it does this only on a playbook basis.

What is the appropriate name+keyword of the feature?

I used max_concurrent, because it is describing best what it does. But as there is serial for playbooks, we might also use serial as the final name for max_concurrent. But then we would need to also support the use of percentage to have it consistent.

Example:

...
  - name: Task to be executed in a serialized way
    serialize_me:
      attr: test
    when: serialized_task_needed
    max_concurrent: 1
...

Testing

I tmight be needed to have additional tests to make sure that the serialization of tasks is also working with strategy: free.

Documentation

A documentation is needed for the limitation of concurrent task executions for single tasks. It should be very similar to forks but limited to tasks only.

agaffney commented 6 years ago

Wouldn't it make sense to extend the existing play-level serial keyword instead of creating a new one that operates just at the task level?

t-woerner commented 6 years ago

Yes, using serial keyword would be good. But adding this functionality to PlaybookExecutor would require a by far more invasive change. The current implementation is very simple and not invasive.

bcoca commented 6 years ago

Cannot the same result be had with a intermediate play with serial: 1? i.e:

- play: all
  tasks:
   ....
- play: all
  serial: 1
  tasks:
    ...
- play: all
  tasks:
    ...
dagwieers commented 6 years ago

@bcoca You cannot do this in a role, so it is pretty limited. Besides serial has a downside that it blocks for a single batch if one host is very slow. Instead being able to use forks on a single task, block, role or play would be very useful. Rationale: https://github.com/ansible/ansible/issues/24037

t-woerner commented 6 years ago

@bcoca We have playbooks that are using several roles. The task that requires the special treatment is within tasks/main.yml of one of the roles. The role is already used in two playbooks and will be used later in another role using include_role also.

With strategy:free the expected behaviour should be very similar as with linear: As long as there is one task in the list of tasks that are currently handled or queued, the number of used workers is limited to max_concurrent. With free other tasks that are executed at the same time as a task using max_concurrent are affected as well.

If you are using serial, then you are limiting the number of parallel execution already. If there is additionally max_concurrent, then the number of the executed tasks with this setting are further limited if and only if max_concurrent < serial (serial is seen a normal number here). If max_concurrent is bigger than serial, then there should be no effective change.

I do not see how max_concurrent could affect maximum_failure_percentage more than forks is able to do this right now.

max_concurrent is not able to increase the number of workers that are used to process the taks in the playbook. It is only able to limit the already defined numbers of workers.

max_concurrent will be affecting per loop forks if they are specified at the same time and if max_concurrent < per-loop-forks.

bcoca commented 6 years ago

@dagwieers alternatively:

- task:
  delegate_to: '{{item}}'
  with_items: '{{ansible_play_hosts}}'
  run_once: true
dagwieers commented 6 years ago

@bcoca In our use-case we have a terminal server that can only handle 4 concurrent connections reliably. So we need forks: 4 on a per-task basis.

dagwieers commented 6 years ago

@mpdehaan actually hinted at this possibility here: https://groups.google.com/d/msg/ansible-project/rBcWzXjt-Xc/_QCTljBcCG0J

bcoca commented 6 years ago

The problem is for non linear, since forks:1 on task X seems to force all other tasks to wait, that does not seem right to me.

t-woerner commented 6 years ago

@bcoca Using the work-a-round with delegate_to, with_items and run_once correctly is not that simple as long as you are using registered results from previous tasks as we do. Also we might need to register the results of one of the affected tasks later. Only the execution of the task on the first host succeeded for me as the tasks on all hosts got the settings for the first host. This succeeded task was marked as failed and the playboook processing stopped completely even for the succeeded one. Therefore this is not a possible solution for us.

bcoca commented 6 years ago

@t-woerner to use previous results, you can us hostvars[item]['resultvar' and to consume results from that task you can do registeredvar['results'][ansible_play_vars.index(inventory_hostname)].

As for the 'failed' status you can use the registered var to make it 'failed' only if all results failed, i.e: failed_when: regvar is failed and select(regvar, 'is', 'failed')|length == regvar|length

t-woerner commented 6 years ago

@bcoca Yes, with non linear strategy all tasks are affected while a task with max_concurrent is processed. This is something that I expected. If you want to we might simply limit the use of max_concurrent to the linear strategy.

t-woerner commented 6 years ago

Even if this work-a-round will be working for me at some point with showing failed and changed correctly it will only be able to handle the max_concurrent:1 case. There is work under way in FreeIPA to increase the number of reliably deployable replicas in parallel to more than 1. Therefore we need a solution also for the max_concurrent:2 or max_concurrent:3 cases.

bcoca commented 6 years ago

Well, those can be handled with serial play, but not 'mid role'. If nothing else, we narrowed down the use case that cannot be covered by existing methods.

The limitation to 'linear' does not seem to be required by design, can that just be a limitation of how you want to implement it?

t-woerner commented 6 years ago

No, the limitation is not needed by design. The question now is if an entry in the documentation is sufficient that using max_concurrent or how we name it in the end will have an effect on other tasks for non linear strategies.

rodrigobrim commented 6 years ago

The solution with delegate_to doesnt work with my problem. I need a variable loaded from non loop module (vmware_guest) - the max_concurrent solve my problem. With delegate_to, the nios_host_record fails:

TASK [configure host entry] ** failed: [testevmbrim002] (item=testevmbrim002) => {"item": "testevmbrim002", "msg": "Failed to connect to the host via ssh: ssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true} failed: [testevmbrim002] (item=testevmbrim003) => {"item": "testevmbrim003", "msg": "Failed to connect to the host via ssh: ssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true} failed: [testevmbrim002] (item=testevmbrim001) => {"item": "testevmbrim001", "msg": "Failed to connect to the host via ssh: ssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true} fatal: [testevmbrim002]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"_ansible_ignore_errors": null, "_ansible_item_label": "testevmbrim002", "_ansible_item_result": true, "item": "testevmbrim002", "msg": "Failed to connect to the host via ssh: ssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}, {"_ansible_ignore_errors": null, "_ansible_item_label": "testevmbrim003", "_ansible_item_result": true, "item": "testevmbrim003", "msg": "Failed to connect to the host via ssh: ssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}, {"_ansible_ignore_errors": null, "_ansible_item_label": "testevmbrim001", "_ansible_item_result": true, "item": "testevmbrim001", "msg": "Failed to connect to the host via ssh: ssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}]}

bcoca commented 6 years ago

@rodrigobrim your failures seem unrelated to this discussion, that is a conneciton issue.

Also there are no 'loop modules', loops are produced by Ansible around any module/action.

bcoca commented 3 years ago

implemented as the throttle keyword