Open moonrail opened 4 years ago
@ryanpetrello can you take a peek at this one?
@psuriset of interest to scale team
I'm interested in this issue so let me know if you want help testing @ryanpetrello (or whoever looks at this)
Any updates? We are still experiencing this with 14.0.0.
@moonrail can you confirm that these jobs have allow_simultaneous
set to True
? (this is how it appears in the API, in the UI it is a checkbox that says "Enable Concurrent Jobs"
@kdelee Sorry for not mentioning this: Yes, all scheduled Job Templates were configured to run simultaniously.
I have not found any time yet to test patch #8074 in combination with these use cases. If I do, I;ll post my results.
@moonrail also, do you have update_on_launch
set on the inventory sources? Inventory sources run updates 1 at a time, so if each job launch triggers an inventory source, the jobs would end up blocked behind the inventory updates that are blocked on eachother.
@kdelee For these tests, update on project update was set on for inventory sources. Update on launch was off. Projects were not updated while testing, so syncs should not be the issue.
But if syncs would be the cause - shouldn't then all jobs be at least queued as scheduled, started and wait/fail/lock up because of dependencies (sync, project git update)?
Now that I mention it: SCM Branch Override was off on Projects. Projects were static while testing.
So I've had time to test again. Now that 15.0.0 is released, I've used this version. It contains #8071 and some other performance optimizations, but unfortunately there is no improvement visible.
I've used:
Playbooks, Repository & Inventory-Content identical to opening post.
Schedules:
Settings: MAXIMUM SCHEDULED JOBS: 10000
Result:
Date | Queued Jobs | Queued Diff | Running Jobs | Finished Jobs (not canceled) | Finished Jobs Diff
2020-10-05T13:11:51+02:00 | 11 | 11 | 20 | 696877 | 696877
2020-10-05T13:12:53+02:00 | 13 | 2 | 37 | 696895 | 18
2020-10-05T13:13:55+02:00 | 16 | 3 | 49 | 696918 | 23
2020-10-05T13:14:56+02:00 | 17 | 1 | 48 | 696930 | 12
2020-10-05T13:15:58+02:00 | 17 | 0 | 47 | 696946 | 16
2020-10-05T13:17:00+02:00 | 16 | -1 | 38 | 696967 | 21
2020-10-05T13:18:02+02:00 | 17 | 1 | 25 | 696990 | 23
2020-10-05T13:19:04+02:00 | 15 | -2 | 21 | 697013 | 23
2020-10-05T13:20:06+02:00 | 20 | 5 | 17 | 697031 | 18
2020-10-05T13:21:07+02:00 | 16 | -4 | 19 | 697052 | 21
2020-10-05T13:22:09+02:00 | 15 | -1 | 27 | 697064 | 12
2020-10-05T13:23:11+02:00 | 16 | 1 | 26 | 697083 | 19
2020-10-05T13:24:13+02:00 | 19 | 3 | 27 | 697104 | 21
2020-10-05T13:25:14+02:00 | 30 | 11 | 47 | 697134 | 30
2020-10-05T13:26:16+02:00 | 37 | 7 | 51 | 697152 | 18
2020-10-05T13:27:18+02:00 | 25 | -12 | 57 | 697174 | 22
2020-10-05T13:28:20+02:00 | 30 | 5 | 57 | 697181 | 7
2020-10-05T13:29:21+02:00 | 35 | 5 | 49 | 697196 | 15
2020-10-05T13:30:23+02:00 | 35 | 0 | 31 | 697220 | 24
2020-10-05T13:31:25+02:00 | 41 | 6 | 39 | 697262 | 42
2020-10-05T13:32:27+02:00 | 44 | 3 | 37 | 697266 | 4
2020-10-05T13:33:29+02:00 | 31 | -13 | 42 | 697283 | 17
2020-10-05T13:34:30+02:00 | 24 | -7 | 45 | 697301 | 18
2020-10-05T13:35:32+02:00 | 27 | 3 | 35 | 697328 | 27
2020-10-05T13:36:34+02:00 | 25 | -2 | 35 | 697348 | 20
2020-10-05T13:37:36+02:00 | 33 | 8 | 32 | 697362 | 14
2020-10-05T13:38:37+02:00 | 26 | -7 | 31 | 697390 | 28
2020-10-05T13:39:39+02:00 | 37 | 11 | 27 | 697412 | 22
2020-10-05T13:40:41+02:00 | 32 | -5 | 58 | 697436 | 24
2020-10-05T13:41:43+02:00 | 61 | 29 | 48 | 697448 | 12
2020-10-05T13:42:45+02:00 | 43 | -18 | 49 | 697473 | 25
2020-10-05T13:43:46+02:00 | 37 | -6 | 51 | 697490 | 17
2020-10-05T13:44:48+02:00 | 53 | 16 | 55 | 697509 | 19
2020-10-05T13:45:50+02:00 | 59 | 6 | 48 | 697521 | 12
2020-10-05T13:46:51+02:00 | 57 | -2 | 52 | 697549 | 28
2020-10-05T13:47:53+02:00 | 67 | 10 | 40 | 697567 | 18
2020-10-05T13:48:55+02:00 | 53 | -14 | 41 | 697602 | 35
2020-10-05T13:49:57+02:00 | 45 | -8 | 48 | 697625 | 23
2020-10-05T13:50:59+02:00 | 44 | -1 | 47 | 697642 | 17
2020-10-05T13:52:00+02:00 | 55 | 11 | 50 | 697644 | 2
2020-10-05T13:53:03+02:00 | 62 | 7 | 62 | 697672 | 28
2020-10-05T13:54:05+02:00 | 70 | 8 | 54 | 697695 | 23
2020-10-05T13:55:07+02:00 | 67 | -3 | 55 | 697714 | 19
2020-10-05T13:56:08+02:00 | 67 | 0 | 53 | 697736 | 22
2020-10-05T13:57:10+02:00 | 66 | -1 | 53 | 697752 | 16
2020-10-05T13:58:12+02:00 | 88 | 22 | 47 | 697765 | 13
2020-10-05T13:59:14+02:00 | 70 | -18 | 62 | 697789 | 24
2020-10-05T14:00:15+02:00 | 68 | -2 | 50 | 697811 | 22
2020-10-05T14:01:17+02:00 | 53 | -15 | 53 | 697830 | 19
2020-10-05T14:02:19+02:00 | 57 | 4 | 42 | 697842 | 12
2020-10-05T14:03:21+02:00 | 51 | -6 | 43 | 697855 | 13
2020-10-05T14:04:23+02:00 | 54 | 3 | 25 | 697879 | 24
2020-10-05T14:05:24+02:00 | 61 | 7 | 16 | 697905 | 26
2020-10-05T14:06:26+02:00 | 52 | -9 | 19 | 697924 | 19
2020-10-05T14:07:28+02:00 | 62 | 10 | 20 | 697939 | 15
2020-10-05T14:08:30+02:00 | 77 | 15 | 24 | 697956 | 17
2020-10-05T14:09:31+02:00 | 56 | -21 | 32 | 697979 | 23
2020-10-05T14:10:33+02:00 | 52 | -4 | 37 | 697990 | 11
2020-10-05T14:11:35+02:00 | 60 | 8 | 34 | 697995 | 5
2020-10-05T14:12:37+02:00 | 50 | -10 | 21 | 698026 | 31
I've watched specific Schedules and noticed, they are executed, but delayed by 3-10 minutes.
So e.g. watching a Schedule it can show "NEXT RUN" 1:16:00 PM even after server time exceeded it, until it would be updated to "NEXT RUN" 1:19:00 PM.
ISSUE TYPE
SUMMARY
We are testing AWX/Ansible Tower as a scalable central Deployment-Method & Technical Workflow Handler and ran into an issue with schedules. We have a colocated kubernetes cluster of 15 AWX instances (1485 capacity) & dedicated DB. For testing we've made several testjobs. While trying 100 scheduled Jobs per minute, we've been experiencing Database-Locks (330 seconds while updating "Last Job ID" on Templates) when using the same project, job-template & inventory.
So we've split and reduced it to:
Scheduler should therefore start 48 Jobs per minute.
But it only starts around 10-30 Jobs per minute.
We've couldn't find any errors in logs.
ENVIRONMENT
STEPS TO REPRODUCE
Enable SSH Pipelining in ansible.cfg
Create Playbook to simulate workload:
Create Resources:
EXPECTED RESULTS
48 Jobs per minute
ACTUAL RESULTS
10-30 Jobs per minute, even with a lot of remaining capacity (~75%)
ADDITIONAL INFORMATION
Made a simple script to log job counts. Example output:
A lot of capacity is remaining: