ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.93k stars 3.41k forks source link

Scheduler/Dispatcher does not honor all schedules #7659

Open moonrail opened 4 years ago

moonrail commented 4 years ago
ISSUE TYPE
SUMMARY

We are testing AWX/Ansible Tower as a scalable central Deployment-Method & Technical Workflow Handler and ran into an issue with schedules. We have a colocated kubernetes cluster of 15 AWX instances (1485 capacity) & dedicated DB. For testing we've made several testjobs. While trying 100 scheduled Jobs per minute, we've been experiencing Database-Locks (330 seconds while updating "Last Job ID" on Templates) when using the same project, job-template & inventory.

So we've split and reduced it to:

Scheduler should therefore start 48 Jobs per minute.

But it only starts around 10-30 Jobs per minute.

We've couldn't find any errors in logs.

ENVIRONMENT
STEPS TO REPRODUCE

Enable SSH Pipelining in ansible.cfg

Create Playbook to simulate workload:

Create Resources:

EXPECTED RESULTS

48 Jobs per minute

ACTUAL RESULTS

10-30 Jobs per minute, even with a lot of remaining capacity (~75%)

ADDITIONAL INFORMATION

Made a simple script to log job counts. Example output:

Date | Queued Jobs | Queued Diff | Running Jobs | Finished Jobs (not canceled) | Finished Jobs Diff
2020-07-20T15:41:22+02:00 | 41 | 33 | 19 | 625169 | 12
2020-07-20T15:42:23+02:00 | 10 | -31 | 23 | 625204 | 35
2020-07-20T15:43:25+02:00 | 14 | 4 | 11 | 625225 | 21
2020-07-20T15:44:26+02:00 | 14 | 0 | 11 | 625244 | 19
2020-07-20T15:45:27+02:00 | 2 | -12 | 19 | 625267 | 23
2020-07-20T15:46:28+02:00 | 4 | 2 | 15 | 625288 | 21
2020-07-20T15:47:30+02:00 | 13 | 9 | 33 | 625320 | 32
2020-07-20T15:48:31+02:00 | 40 | 27 | 19 | 625337 | 17
2020-07-20T15:49:32+02:00 | 21 | -19 | 19 | 625363 | 26
2020-07-20T15:50:33+02:00 | 15 | -6 | 15 | 625379 | 16
2020-07-20T15:51:35+02:00 | 9 | -6 | 13 | 625405 | 26
2020-07-20T15:52:36+02:00 | 2 | -7 | 17 | 625424 | 19
2020-07-20T15:53:38+02:00 | 32 | 30 | 19 | 625443 | 19
2020-07-20T15:54:39+02:00 | 3 | -29 | 26 | 625478 | 35
2020-07-20T15:55:40+02:00 | 1 | -2 | 12 | 625502 | 24
2020-07-20T15:56:41+02:00 | 2 | 1 | 12 | 625520 | 18
2020-07-20T15:57:42+02:00 | 1 | -1 | 9 | 625537 | 17
2020-07-20T15:58:44+02:00 | 27 | 26 | 11 | 625551 | 14
2020-07-20T15:59:45+02:00 | 3 | -24 | 39 | 625579 | 28
2020-07-20T16:00:46+02:00 | 17 | 14 | 45 | 625604 | 25
2020-07-20T16:01:47+02:00 | 10 | -7 | 53 | 625641 | 37
2020-07-20T16:02:48+02:00 | 6 | -4 | 30 | 625668 | 27
2020-07-20T16:03:50+02:00 | 6 | 0 | 14 | 625693 | 25
2020-07-20T16:04:51+02:00 | 4 | -2 | 14 | 625713 | 20
2020-07-20T16:05:52+02:00 | 9 | 5 | 19 | 625730 | 17

A lot of capacity is remaining:

{
    (...)
    "name": "tower",
    (...)
    "capacity": 1485,
    "committed_capacity": 0,
    "consumed_capacity": 310,
    "percent_capacity_remaining": 79.12,
    (...)
}
blomquisg commented 4 years ago

@ryanpetrello can you take a peek at this one?

kdelee commented 4 years ago

@psuriset of interest to scale team

kdelee commented 4 years ago

I'm interested in this issue so let me know if you want help testing @ryanpetrello (or whoever looks at this)

moonrail commented 4 years ago

Any updates? We are still experiencing this with 14.0.0.

kdelee commented 4 years ago

@moonrail can you confirm that these jobs have allow_simultaneous set to True? (this is how it appears in the API, in the UI it is a checkbox that says "Enable Concurrent Jobs"

moonrail commented 4 years ago

@kdelee Sorry for not mentioning this: Yes, all scheduled Job Templates were configured to run simultaniously.

I have not found any time yet to test patch #8074 in combination with these use cases. If I do, I;ll post my results.

kdelee commented 4 years ago

@moonrail also, do you have update_on_launch set on the inventory sources? Inventory sources run updates 1 at a time, so if each job launch triggers an inventory source, the jobs would end up blocked behind the inventory updates that are blocked on eachother.

moonrail commented 4 years ago

@kdelee For these tests, update on project update was set on for inventory sources. Update on launch was off. Projects were not updated while testing, so syncs should not be the issue.

But if syncs would be the cause - shouldn't then all jobs be at least queued as scheduled, started and wait/fail/lock up because of dependencies (sync, project git update)?

Now that I mention it: SCM Branch Override was off on Projects. Projects were static while testing.

moonrail commented 3 years ago

So I've had time to test again. Now that 15.0.0 is released, I've used this version. It contains #8071 and some other performance optimizations, but unfortunately there is no improvement visible.

I've used:

Playbooks, Repository & Inventory-Content identical to opening post.

Schedules:

Settings: MAXIMUM SCHEDULED JOBS: 10000

Result:

Date | Queued Jobs | Queued Diff | Running Jobs | Finished Jobs (not canceled) | Finished Jobs Diff
2020-10-05T13:11:51+02:00 | 11 | 11 | 20 | 696877 | 696877
2020-10-05T13:12:53+02:00 | 13 | 2 | 37 | 696895 | 18
2020-10-05T13:13:55+02:00 | 16 | 3 | 49 | 696918 | 23
2020-10-05T13:14:56+02:00 | 17 | 1 | 48 | 696930 | 12
2020-10-05T13:15:58+02:00 | 17 | 0 | 47 | 696946 | 16
2020-10-05T13:17:00+02:00 | 16 | -1 | 38 | 696967 | 21
2020-10-05T13:18:02+02:00 | 17 | 1 | 25 | 696990 | 23
2020-10-05T13:19:04+02:00 | 15 | -2 | 21 | 697013 | 23
2020-10-05T13:20:06+02:00 | 20 | 5 | 17 | 697031 | 18
2020-10-05T13:21:07+02:00 | 16 | -4 | 19 | 697052 | 21
2020-10-05T13:22:09+02:00 | 15 | -1 | 27 | 697064 | 12
2020-10-05T13:23:11+02:00 | 16 | 1 | 26 | 697083 | 19
2020-10-05T13:24:13+02:00 | 19 | 3 | 27 | 697104 | 21
2020-10-05T13:25:14+02:00 | 30 | 11 | 47 | 697134 | 30
2020-10-05T13:26:16+02:00 | 37 | 7 | 51 | 697152 | 18
2020-10-05T13:27:18+02:00 | 25 | -12 | 57 | 697174 | 22
2020-10-05T13:28:20+02:00 | 30 | 5 | 57 | 697181 | 7
2020-10-05T13:29:21+02:00 | 35 | 5 | 49 | 697196 | 15
2020-10-05T13:30:23+02:00 | 35 | 0 | 31 | 697220 | 24
2020-10-05T13:31:25+02:00 | 41 | 6 | 39 | 697262 | 42
2020-10-05T13:32:27+02:00 | 44 | 3 | 37 | 697266 | 4
2020-10-05T13:33:29+02:00 | 31 | -13 | 42 | 697283 | 17
2020-10-05T13:34:30+02:00 | 24 | -7 | 45 | 697301 | 18
2020-10-05T13:35:32+02:00 | 27 | 3 | 35 | 697328 | 27
2020-10-05T13:36:34+02:00 | 25 | -2 | 35 | 697348 | 20
2020-10-05T13:37:36+02:00 | 33 | 8 | 32 | 697362 | 14
2020-10-05T13:38:37+02:00 | 26 | -7 | 31 | 697390 | 28
2020-10-05T13:39:39+02:00 | 37 | 11 | 27 | 697412 | 22
2020-10-05T13:40:41+02:00 | 32 | -5 | 58 | 697436 | 24
2020-10-05T13:41:43+02:00 | 61 | 29 | 48 | 697448 | 12
2020-10-05T13:42:45+02:00 | 43 | -18 | 49 | 697473 | 25
2020-10-05T13:43:46+02:00 | 37 | -6 | 51 | 697490 | 17
2020-10-05T13:44:48+02:00 | 53 | 16 | 55 | 697509 | 19
2020-10-05T13:45:50+02:00 | 59 | 6 | 48 | 697521 | 12
2020-10-05T13:46:51+02:00 | 57 | -2 | 52 | 697549 | 28
2020-10-05T13:47:53+02:00 | 67 | 10 | 40 | 697567 | 18
2020-10-05T13:48:55+02:00 | 53 | -14 | 41 | 697602 | 35
2020-10-05T13:49:57+02:00 | 45 | -8 | 48 | 697625 | 23
2020-10-05T13:50:59+02:00 | 44 | -1 | 47 | 697642 | 17
2020-10-05T13:52:00+02:00 | 55 | 11 | 50 | 697644 | 2
2020-10-05T13:53:03+02:00 | 62 | 7 | 62 | 697672 | 28
2020-10-05T13:54:05+02:00 | 70 | 8 | 54 | 697695 | 23
2020-10-05T13:55:07+02:00 | 67 | -3 | 55 | 697714 | 19
2020-10-05T13:56:08+02:00 | 67 | 0 | 53 | 697736 | 22
2020-10-05T13:57:10+02:00 | 66 | -1 | 53 | 697752 | 16
2020-10-05T13:58:12+02:00 | 88 | 22 | 47 | 697765 | 13
2020-10-05T13:59:14+02:00 | 70 | -18 | 62 | 697789 | 24
2020-10-05T14:00:15+02:00 | 68 | -2 | 50 | 697811 | 22
2020-10-05T14:01:17+02:00 | 53 | -15 | 53 | 697830 | 19
2020-10-05T14:02:19+02:00 | 57 | 4 | 42 | 697842 | 12
2020-10-05T14:03:21+02:00 | 51 | -6 | 43 | 697855 | 13
2020-10-05T14:04:23+02:00 | 54 | 3 | 25 | 697879 | 24
2020-10-05T14:05:24+02:00 | 61 | 7 | 16 | 697905 | 26
2020-10-05T14:06:26+02:00 | 52 | -9 | 19 | 697924 | 19
2020-10-05T14:07:28+02:00 | 62 | 10 | 20 | 697939 | 15
2020-10-05T14:08:30+02:00 | 77 | 15 | 24 | 697956 | 17
2020-10-05T14:09:31+02:00 | 56 | -21 | 32 | 697979 | 23
2020-10-05T14:10:33+02:00 | 52 | -4 | 37 | 697990 | 11
2020-10-05T14:11:35+02:00 | 60 | 8 | 34 | 697995 | 5
2020-10-05T14:12:37+02:00 | 50 | -10 | 21 | 698026 | 31

I've watched specific Schedules and noticed, they are executed, but delayed by 3-10 minutes.

So e.g. watching a Schedule it can show "NEXT RUN" 1:16:00 PM even after server time exceeded it, until it would be updated to "NEXT RUN" 1:19:00 PM.