Incrementally redesign Cuebot's monolith into multiple services
Motivation
Cuebot's current design doesn't scale well horizontally. Although multiple instances of the service can be load balanced to spread rqd's requests, all instances still rely on a single SQL database that can only scale vertically.
The current design relies heavily on the performance of the DispatchQuery, which is a costly query that degrades according to the size of the frames table.
We received multiple feedbacks from different studios interested in the project that were scared of adding a Java based application to their stack, as java is not commonly used in the VFX/Animation industry.
Current Design challenges
rqd's connect directly to cuebot using grpc and this connection is binding until one of them restart, which makes distributing load without outage a challenge.
The scheduling logic is implemented as a step on the logic that handles rqd reports. This design makes the process not only hard to maintain, but also creates a coupling that impacts performance. Any step on the report handling that takes longer than anticipated will impact the speed at which frames are booked.
Performance inefficiency arises when multiple nodes attempt to book the same layer. Without a global lock mechanism, conflicts are only resolved at the final step of the booking process, preventing a frame from running on multiple hosts.
Motivation
DispatchQuery
, which is a costly query that degrades according to the size of the frames table.Current Design challenges
Constraints
Proposal