MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.7k stars 465 forks source link

Decouple compute controller processing from the coordinator #27656

Open teskje opened 2 months ago

teskje commented 2 months ago

The compute controller is currently mostly implemented as a library whose usage follows a "ready/process" pattern. The coordinator's message loop selects on the ready method, along with other things it waits on. When ready resolves, that signals that the compute controller has work to do, so the coordinator invokes its process method. process then performs any work that has queued up in the compute controller.

process is an async method, which means it can in theory block the coordinator for an unbounded amount of time. In practice we hope that it never blocks for long in async calls, but there is no guarantee. There are also no guarantees about how long the synchronous processing takes. Using time on the coordinator's message handling thread introduces delays in the handling of other messages, including user queries, which can result in degraded responsiveness of the system.

Previously, the compute controller's processing had to be invoked in the coordinators message loop because it required access to a &mut StorageController. With the introduction of StorageCollections, most of the compute controller now does not depend on external mutable state anymore, so it is possible to decouple it from the message loop.

The proposal is to spawn a separate task for each Instance managing a compute cluster. The top-level ComputeController dispatches commands to the different instances through command queues, and each instance continually reads from its queue and executes the provided commands. This will decouple most of the compute controller's processing from the coordinator.

Blockers

This is blocked by https://github.com/MaterializeInc/materialize/issues/24266, the migration of the compute controller to ReadHolds. Having the compute controller run in the background concurrently will make it impossible for the coordinator to rely on read frontiers not advancing during its processing, so we will need to find and resolve all instances where it does so (if there still are any). Having the compute controller communicate its requirements in the form of ReadHold capabilities will make this much easier.

teskje commented 2 months ago

Apart from responsiveness concerns, doing this refactor also avoids bugs that come from a disagreement between ready and process, like the one that caused https://github.com/MaterializeInc/materialize/pull/27518.