Trying to determine if this would be a good performance boost overal.
General Situation:
As the number of offers and scheduled tasks gets larger, the offer scoring loop gets slower and slower. Generally not a great thing that it scales in complexity as {num pending tasks} x {num offers}. So, when we get behind, it puts us further behind
Much time in the offer loop is actually spent waiting on locks, during which not much actual work is done
A lot of the offer filtering/etc is done on each run, but could probably be kept somewhere else as a 'live' view of current available offers
Design thoughts:
Offers flow directly into a revamped version of the offer cache
Any filtering or joining of offers with their associated agent usages can be done here
TODO - how to handle case where offer cache is turned off (shared cluster environment)
Pending tasks is a PriorityBlockingQueue
A primary queue is used with more + lighter threads that simply wait on locks
Once locks are acquierd things are put on the actual pending task queue, which can assume it is behind a lock and do only meaningful work in a tight loop with all offers/agents/usages/etc cached
Tasks are launched immediately once a match is made instead of having to wait for the rest of the poll cycle to complete
Upsides:
More responsive task launch timing
Complexity scales as {num offers} within the tight loop since only a single pending task is being worked on at once
Downsides/Issues:
Is a preliminary queue to wait on locks actually sustainable/a bad idea?
Tricky to handle offers than may expire if we are trying to look at a consistent 'live' view of offers and have minimal locking for usage in a tight loop
Also need to keep zk list of pending tasks in sync with in-memory list. Similar to leader cache I guess here, but split between upcoming pending tasks and 'due now' pending tasks
Good amount of progress here. Over half of tests passing (290/338). Some of which just need to be changed to fit the new scheduler style. Big open TODOs:
[ ] Behavior of offer cache in shared env (e.g. normally would have been with offer cache disabled). Maybe this is just a default shorter TTL
[ ] Is there a way we could batch launches still without incurring too much overhead? e.g. use one offer for two tasks to improve the number of mesos master calls made
[ ] Should offer matching be allowed to run concurrently or should it just be a single thread in a loop to make offer checkout/locking simpler
[ ] Metrics to inspect state of the queue
[ ] Actually implement priority for the priority queue
Trying to determine if this would be a good performance boost overal.
General Situation:
Design thoughts:
Upsides:
Downsides/Issues: