ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.55k stars 3.35k forks source link

Implement Job Template Sharding/Splitting/Slicing #2174

Closed matburt closed 5 years ago

matburt commented 5 years ago

This is a work-in-progress, see https://github.com/ansible/awx/issues/1283

AlanCoding commented 5 years ago

Hit this bug

awx_1        | 13:25:22 celeryd.1   | 2018-08-27 13:25:22,366 ERROR    awx.main.tasks Task awx.main.scheduler.tasks.run_task_manager encountered exception.
awx_1        | 13:25:22 celeryd.1   | Traceback (most recent call last):
awx_1        | 13:25:22 celeryd.1   |   File "/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
awx_1        | 13:25:22 celeryd.1   |     R = retval = fun(*args, **kwargs)
awx_1        | 13:25:22 celeryd.1   |   File "/venv/awx/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
awx_1        | 13:25:22 celeryd.1   |     return self.run(*args, **kwargs)
awx_1        | 13:25:22 celeryd.1   |   File "/awx_devel/awx/main/scheduler/tasks.py", line 31, in run_task_manager
awx_1        | 13:25:22 celeryd.1   |     TaskManager().schedule()
awx_1        | 13:25:22 celeryd.1   |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 693, in schedule
awx_1        | 13:25:22 celeryd.1   |     finished_wfjs = self._schedule()
awx_1        | 13:25:22 celeryd.1   |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 678, in _schedule
awx_1        | 13:25:22 celeryd.1   |     self.spawn_workflow_graph_jobs(running_workflow_tasks)
awx_1        | 13:25:22 celeryd.1   |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 197, in spawn_workflow_graph_jobs
awx_1        | 13:25:22 celeryd.1   |     job.name = "{} - {}".format(job.name, spawn_node.ancestor_artifacts['job_shard'] + 1)
awx_1        | 13:25:22 celeryd.1   | UnicodeEncodeError: 'ascii' codec can't encode character u'\ud007' in position 27: ordinal not in range(128)
AlanCoding commented 5 years ago

confirmed that bug will be resolved by

diff --git a/awx/main/scheduler/task_manager.py b/awx/main/scheduler/task_manager.py
index 4bb0c03d70..c1f71f18d1 100644
--- a/awx/main/scheduler/task_manager.py
+++ b/awx/main/scheduler/task_manager.py
@@ -194,7 +194,7 @@ class TaskManager():
                 kv = spawn_node.get_job_kwargs()
                 job = spawn_node.unified_job_template.create_unified_job(**kv)
                 if 'job_shard' in spawn_node.ancestor_artifacts:
-                    job.name = "{} - {}".format(job.name, spawn_node.ancestor_artifacts['job_shard'] + 1)
+                    job.name = six.text_type("{} - {}").format(job.name, spawn_node.ancestor_artifacts['job_shard'] + 1)
                     job.save()
                 spawn_node.job = job
                 spawn_node.save()
softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

matburt commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

matburt commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

AlanCoding commented 5 years ago

@wenottingham @kialam I'm staging the changes for the rename of shard->split. I would like to cover these in a single commit (will update the QE tests at the same time), and IMO it will make the most sense to update the UI at the same time. I hope that I can cover all manual work, then do a fairly automated rename of the rest, then verify both tests passing & UI functionality efficiently.

Field or text new value
job_template plus help_text split_job_template
job_shard_count help_text and minimum value job_split_count or split_job_count minimum of 1 ✅
internal_limit help_text some help text
sharded_jobs related link split_jobs ✅
internal limit syntax shard0of3 split1of3 (note, changing to 1 for first) ✅
UI: Edit the shard job template Edit the split job template *splitting job template?? ✅
Shard Template Split Template Split Job Template Splitlate ✅

There are relatively minor decisions remaining, but I still want to get conclusiveness on those.

matburt commented 5 years ago

recheck

wenottingham commented 5 years ago

Discussing with others, and the intermediate consensus is that 'split' is not the greatest terminology either.

Current suggestions:

  1. Job Slicing -> "job slices" -> "slice count" -> "slice 1 of N"
  2. Distributed Jobs -> "job slices" -> "slice count" -> "slice 1 of N"
softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

AlanCoding commented 5 years ago

Second attempt at an agreeable rename:

Field or text new value even newer value
job_template w/o help_text split_job_template job_template with help_text
job_shard_count help_text and minimum value job_split_count or split_job_count minimum of 1 ✅ job_slice_count minimum of 1
internal_limit help_text some help text replaced by job_slice_count (WJ & J) job_slice_number (J only)
sharded_jobs related link split_jobs ✅ slice_workflow_jobs ?? (maybe remove)
internal limit syntax shard0of3 split0of3 ✅ (change to 1 for first) ❌ slice1of3
UI: Edit the shard job template Edit the split job template ✅ *splitting job template?? Edit the slice job template
Shard Template Split Template Split Job Template Splitlate ✅ Slice Job Template

Scheme for the job & workflow job field names:

# job serialization
{
    "id": 11,
    "type": "job",
    "url": "/api/v2/jobs/11/",
    "related": {...},
    "summary_fields": {...},
    ...
    "job_slice_number": 2,
    "job_slice_count": 5
},
# workflow job serialization
{
    "id": 11,
    "type": "workflow_job",
    "url": "/api/v2/workflow_jobs/11/",
    "related": {...},
    "summary_fields": {...},
    ...
    "job_slice_count": 5
}

The related link (previously sharded_jobs, then split_jobs) would presumably have existed for the UI to use in parallel to the job template RECENT JOBS tab. That's still obtainable through a query filter, and the UI doesn't have any plans of adding such a thing to my knowledge. Maybe we just delete this.

wenottingham commented 5 years ago

"replaced by job_slice_count (WJ & J) job_slice_number (J only)"

I thought we were not having the count on the individual slices?

wenottingham commented 5 years ago

What is the field/naming for how it is split - does the template define the number of slices, or the size of any one slice?

AlanCoding commented 5 years ago

oh, right, I was forgetting about the confusion with non-sliced workflow jobs having an extra integer which would be confusingly 1. The other option for workflow jobs we talked about was to have a boolean field is_sliced_job. That's fine with me, but increasingly I am wanting the count to be on the job record as well.

one-t commented 5 years ago

I've identified a bug where hosts that are in groups in an inventory will be targeted in every s(lice|plit|hard).

Steps to reproduce:

  1. Create an inventory and add some hosts (I did 50)
  2. Create an inventory group and add a few to it
  3. Run a job template with a split setting > 1
  4. Observe in the job details that all group hosts are targeted in every shard

I figured this out because I was doing some exploratory testing with user-supplied limits- random hosts were failing when the job was sharded and I used this playbook to identify them

---
- hosts: all
  gather_facts: false
  tasks:
    - fail:
      when: ansible_connection != 'local'
softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

AlanCoding commented 5 years ago

Update: I will keep to the plan of using the is_sliced_job field. This means that orphaned sliced workflow jobs cannot be relaunched. This seems sensible after I thought about it.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

AlanCoding commented 5 years ago

@kialam It looks like we need to figure out how to integrate these commits into here.

https://github.com/matburt/awx/pull/3

The things still on my agenda with relevance to this branch:

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

shanemcd commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

AlanCoding commented 5 years ago

A couple of pre-merge checks...

bash-4.2$ awx-manage makemigrations
No changes detected

Good with migrations.

Tests are looking good.

Will push the new tooltip text and rebase when I have agreement on that.

AlanCoding commented 5 years ago

notes for docs impact:

jakemcdermott commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

softwarefactory-project-zuul[bot] commented 5 years ago

Build succeeded.

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

jakemcdermott commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

jakemcdermott commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.

jakemcdermott commented 5 years ago

recheck

softwarefactory-project-zuul[bot] commented 5 years ago

Build failed.