Scheduler Logic redesign

DiegoTavares commented 3 years ago

Opening an issue to start drafting a proposal for the new scheduler logic, as discussed in the last TSC meeting (Jul 21).

Problems with the current design

The scheduling logic depends on the database to keep all instances of cuebot in the same page, which can be a scalability issue as increasing the number of cuebots also increases the load on the database eventually slowing down queries
The current design relies on multiple threadpools with manually configured sizes and limits. Getting the optimal number of threads for each threadpool can be tricky as the time required to execute each type of task might change depending on how fast the database queries are, which is influenced directly by the number of jobs and their status. A better solution would be able to calibrate itself depending on the load, or even not depend on different threadpools to require calibration using some form of back-pressure logic.
The current design allows multiple instances to try to book the same job for multiple hosts, only failing when the database identifies a conflict at the dispatch step, meaning all computation up to the dispatch step was a wast of CPU time.

Proposal

TBD

splhack commented 3 years ago

FIFO scheduling
- https://github.com/AcademySoftwareFoundation/OpenCue/pull/1060
GPU scheduling logic
- https://github.com/AcademySoftwareFoundation/OpenCue/issues/991
- For example, we might want a way to assign CPUs evenly to each GPU. (40 CPU cores / 8 GPU units → 5 CPU cores per GPU.) We can set it if each single machine has the same configuration, not possible if not.
One frame per machine
- https://lists.aswf.io/g/opencue-user/message/411
- how can I achieve that one machine get only one frame and that frame render use all cores on that machine, regardless how many cores the given machine have
Limit per machine
- The current limit works well with licenses, but not enough with per-machine.
- For example, Arnold license allows to run multiple instances on the same machine. https://arnoldsupport.com/2015/02/05/running-multiple-arnold-instances-on-the-same-computer/

splhack commented 3 years ago

In our environment, no matter how many Cuebot instances there, the frame launching speed is about 8 frames-per-second. It could have been faster if we can avoid this error by improving the scheduling 🙂

frame reservation error, dispatchProcToJob failed to book next frame, com.imageworks.spcue.dispatcher.FrameReservationException: the frame ... was updated by another thread.

An idea. Introduce a new frame state, SCHEDULING or something like that.

findNextDispatchFrames
- Update the frame state to SCHEDULING, set the current timestamp to ts_updated where frame.str_state='WAITING', or, passed certain time from frame.ts_updated with frame.str_state='SCHEDULING' to prevent stray frames by Cuebot crash.
- At the same time, retrieve the updated frames with RETURNING clause.
Schedule the frame!
- the frame ... was updated by another thread won't happen because findNextDispatchFrames is atomic.

splhack commented 2 years ago

Summarized an experimental optimization and the theory in #1069

To solve the scalability issues in #1012 and #1069, my hunch is that we need some sort of a central scheduler process (or one of Cuebot instance can work like that).

1012 - need to avoid wasteful duplicated frame scheduling by multiple Cuebot processes and threads.
1069 - need to reduce sort computation cost with sorted list [insertion: O(log N)], cannot avoid linear search to check the utilization though.

Possible Logic

Cuebots notify the central scheduler process when
- Cuebot received a job submission, job priority change (group change)
- The central scheduler process inserts/moves the job to the sorted job list
- Cuebot dispatched a frame
- The central scheduler process updates the job CPU/GPU utilization
- Cuebot received a frame complete message or Cuebot stopped a frame
- The central scheduler process updates the job CPU/GPU utilization
- Also remove the job from the sorted job list if it finished
The central scheduler process also can reconstruct the sorted job list and CPU/GPU utilization from the ground up.
Cuebots get a frame from the central scheduler process for dispatch

Maybe leveldb is one of the best solution. Two maps.

Sorted Job list
- Key: the combination of int_priority + ts_started(maybe ULID), or random number for the current round-robin scheduling
- Value: Job UUID, group/job/layer CPU/GPU utilization
Job UUID to the key map
- Key: Job UUID
- Value: the key of sorted Job list

DiegoTavares commented 2 years ago

I like the idea of a central scheduler process, we're currently evaluating Redis-Stream as an option to handle not only the Job's queue, but also the HostReports. Will update this issue as soon as we have more to share.

oliviascarfone commented 2 years ago

Proposal - High level overview

Central Scheduler Design Logic

Use Redis Stream for incoming HostReports and for Dispatching Jobs. Redis Streams support persistent store and ordered events and also has the ability to store multiple keys/values per event.

This approach will decouple the processing of HostReports from the dispatch of jobs. Redis Streams with consumer groups guarantees that each message is given to a different consumer (same message will not reach multiple consumers within the same group). This addresses the current flaw where Cuebot instances will assign jobs that have already been dispatched to other Cuebot instances. There will be two types of streams. One in which RQD publishes HostReports that are consumed by Cuebot, and the other where Cuebot publishes available jobs and RQDs consume jobs. In this later case, Cuebot will periodically query the database in order to get a list of jobs that are available for processing.

Logic for Host Reports Queue

RQD acts as producer and will send HostReports to a dedicated Redis Stream for HostReports
- At least once semantics (dropping is less critical, as host reports are sent on an interval)
All Cuebot instances are added to the same Consumer Group and are listening for incoming messages.
HostReports are then stored in the database
In RQD: create connection to Redis server in RqCore module
In Cuebot: create class RedisConsumer, which can be initialized at application start up and connect to the Redis server, and will await incoming messages

Logic for Job Queue

Cuebot acts as the producer. Using a service like zookeeper to elect an instance as the leader, the database will be polled on an interval by the leader for available jobs and published as a priority queue to the Redis Stream dedicated to pending jobs
Limit the amount of Cuebot instances accessing the database directly to one (elected leader instance will access db)
RQD as consumer gets the highest priority job and determines if this job can be run on that host. RQD will send acknowledgement of job if it can be run, otherwise job remains in a pending state for other hosts to check.
- Another idea: configure consumer groups based on RQD host characteristics and dispatch jobs to specific consumer groups based on job requirements

thunders82 commented 2 years ago

Hi,

It's nice to see you are looking into the schedule logic redesign. I shared my opinion about the way the GPU nodes are handled in #991 and was kindly pointed to this thread by @splhack to share my opinion.

Currently, if a GPU is not in use by any GPU job it will not accept any CPU job. It is a waste of resource. I was wondering if it would be possible to implement a similar to logic :

Prio to GPU task on GPU nodes :

CPU job should be assigned to GPU node if no more CPU node available and no GPU job in the queue.
If a CPU job is running on a GPU node and no other GPU node available --> Kill the CPU job and retry on a different node to give room to GPU job waiting in the queue.

What do you think ?

Thank you

AcademySoftwareFoundation / OpenCue