MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

[Evergreen] Dynamic cluster scheduling #13870

Open benesch opened 2 years ago

benesch commented 2 years ago

Background

Today, each cluster in Materialize corresponds to a StatefulSet in Kubernetes with largely static constraints, like "place this service in this AZ" or "use this CPU and memory limit."

This works well for customers who want fine-grained control over their infrastructure. It works less well for customers who don't want that control, and want Materialize to just do the right thing by default.

Proposal

We should create a dynamic cluster scheduler that applies flexible policies. E.g.:

The scheduler needs to effect all these changes without causing downtime. E.g., when moving a replica between AZs, it should spin up a new replica in the new AZ before terminating the old one.

Outstanding work

cc @jseldess

chuck-alt-delete commented 1 year ago

Trying to clarify — would this capability enable a “self destruct” like capability where someone could create a temporary cluster or temporary source? This kind of functionality could have a big impact for go-to-market strategy

benesch commented 1 year ago

would this capability enable a “self destruct” like capability where someone could create a temporary cluster or temporary source?

Yep, it totally could!

benesch commented 1 year ago

Posting some very loose syntax proposals from a recent Slack conversation on this topic:

-- Create a cluster with automatically managed replicas.
CREATE CLUSTER foo REPLICATION FACTOR 2, SIZE 'medium';

-- Size up the cluster. This automatically spins up a new replica
-- at the new size, waits for it to catch up,
-- and then spins down the old replica.
ALTER CLUSTER foo SIZE 'large';

-- Add a new replica automatically.
ALTER CLUSTER foo REPLICATION FACTOR 3;

-- Turn off the cluster for the night.
ALTER CLUSTER foo REPLICATION FACTOR 0;

-- One day...
CREATE CLUSTER blah;
-- ...will create a cluster that automatically scales up and down in
response to workload.
benesch commented 10 months ago

We have two more specific issues that would require dynamic cluster scheduling: