[Epic] Automated cluster pause/resume

benesch commented 8 months ago

Product outcome

Users can cut costs by configuring their clusters to automatically pause and resume as appropriate for their workload.

Discovery

Background

Pausing/resuming a cluster will be particularly powerful when coupled with the new REFRESH modifier on materialized views. A view that is refreshed only daily needs only use compute resources once a day.

See also @chuck-alt-delete's take on why this feature is important: https://github.com/MaterializeInc/product/issues/215#issuecomment-1413024497.

Design

Here is the barest sketch of a design:

CREATE CLUSTER ... (
    PAUSE = {AUTO | AFTER <interval>},
    RESUME = {AUTO | ON SCHEDULE '<cron expression>' | EVERY <interval>},
    ...
)

The PAUSE operation specifies a policy for "pausing" the cluster (i.e., dropping all replicas of the cluster):

AFTER <interval> indicates that the cluster should be paused at <cluster replica creation time> + <interval>
AUTO indicates that the cluster should pause when it has no remaining obligations. A running SELECT or SUBSCRIBE statement is an obligation. An index, sink, or materialized view with REFRESH = ON COMMIT semantics creates a continual obligation at the next instant. A materialized view with a looser refresh policy creates an obligation at the next scheduled refresh.

The RESUME operation specifies a policy for "resuming" the cluster (i.e., recreating the replicas indicated by the replication factor):

AUTO indicates that the cluster should resume ahead of its next obligation. The exact semantics of "ahead" need to be further specified. Ideally, the cluster would resume far enough ahead of its next obligation that it would be fully hydrated at the time of the obligation.
ON SCHEDULE '<cron expression>' indicates that the cluster should be resumed according to the schedule specified by the cron expression.
EVERY <interval> indicates that the cluster should be resumed on the specified interval.

The PAUSE and RESUME options are only valid for use with managed clusters. While a replica is paused, its status is reported as paused in mz_cluster_replica_statuses.

There are many questions to sort out:

How do we record when a cluster's next obligation is?
How do we determine how long a cluster needs to rehydrate? Presumably by measuring hydration time—but where do we record that time?

Use cases

There are three known major use cases for automated pause/resume:

Reducing the duty cycle of the cluster hosting a REFRESH = EVERY ... materialized view. If a view is only refreshed daily, and it takes only 30m to hydrate the state for the view.
Expiring ephemeral resources. Imagine a cluster created by a developer for some ephemeral work, or a cluster automatically created by CI for a pull request. Specifying PAUSE = AFTER '6 hours' will ensure the cluster is spun down even if the developer forgets about it or if CI fails to execute cleanup successfully.
Reducing the cost of playgrounds/trials. The built-in mz_introspection cluster could be made to spin down automatically when no one is using the account, and could spin back up automatically the next time someone issues a query against it.

Work items

Pending further discovery work.

Decision log

3 March 2024. Split out MVP for REFRESH EVERY matviews (https://github.com/MaterializeInc/materialize/issues/25712).
11 November 2023. Issue created.

benesch commented 8 months ago

A thought from watching @parkerhendo's prototype of pausing/resuming with one button in the console (link): perhaps PAUSE and RESUME should be first-class verbs in the database!

Imagine:

ALTER CLUSTER ... (PAUSED);
ALTER CLUSTER ... (PAUSED = FALSE);

And that state played nice with any scheduled pause/resume—basically just lets you override the specified policy on a one-off basis.

benesch commented 8 months ago

One additional wrinkle: how does pause/resume interact with the max_credit_consumption_rate system limit? Users could easily find themselves in a situation where when they create a cluster schedule they have plenty of credits for each resumption, but as time goes on they use up the credits for other workloads. Should each resumption fail if it would exceed the max credit limit? Or should a cluster with scheduled pause/resume take a "credit hold" for the max credits it will ever need, even if it's not actively using those credits?

pH14 commented 8 months ago

This might have some interesting implications on our Kube capacity, as it allows for more bursty pod workloads than what we have today. I could imagine customers wanting to quickly crank through updating a REFRESH EVERY MV by using a larger-than-average cluster size just for the duration needed (in fact, already saw this floated in one customer discussion). It'd be a bummer for the customer if the cluster doesn't fit within our existing nodes, and then those periodic refreshes need to wait for nodes to spin up before beginning rehydration each time, or maybe we can't even get the capacity.

Could also imagine getting peaks around common business hours, which is something we'd want to prep for long-term.

Can't think of any issues we'd need to address before rolling out the v1, but we'll want to keep tabs on how this feature is being used to understand whether we need to tweak our orchestration going forward.

aalexandrov commented 8 months ago

It'd be a bummer for the customer if the cluster doesn't fit within our existing nodes, and then those periodic refreshes need to wait for nodes to spin up before beginning rehydration each time, or maybe we can't even get the capacity.

I think with the current RESUME syntax proposal we would be able to forecast the schedule with high accuracy, so you should be able to allocate nodes for the requested cluster sizes upfront.

ggevay commented 8 months ago

It'd be a bummer for the customer if the cluster doesn't fit within our existing nodes, and then those periodic refreshes need to wait for nodes to spin up before beginning rehydration each time, or maybe we can't even get the capacity.

Yes, this is very much a valid danger. In the meantime, we discussed this here and here. For now, we are not going to alter the planned semantics (e.g., in logical time we'll still refresh exactly at the moment specified by the user), but we'll make it clear to users that we can't offer any hard guarantees that the refreshes will actually happen on time. If a user needs hard guarantees, we'll advise to use a normal, always on mat view.

In the long run, there are various things we can do to increase the chances of successfully completing the refresh on time, e.g.,

allocate resources up front, especially when multiple users / mat views will need refreshes at the same time (as @aalexandrov noted above);
fall back to less-than-ideal (e.g., more expensive) instance types when we can't get our normal ones;
give control to power users over the tradeoff between cost and chances that the refresh will be completed on time.

hlburak commented 8 months ago

The ALTER CLUSTER syntax made me think about a scenario in which a cluster supporting materialized views with user-specified refresh intervals is not on AUTO scheduling or is and then is overridden by the user. Will we ignore obligations on clusters which Materialize does not schedule until they come online again? How will we handle "missed" batches of updates when that happens?

benesch commented 8 months ago

Will we ignore obligations on clusters which Materialize does not schedule until they come online again?

Yes.

How will we handle "missed" batches of updates when that happens?

The same way we handle "missed" updates due to AWS capacity problems: https://materializeinc.slack.com/archives/C063H5S7NKE/p1700138271783999?thread_ts=1699543250.405409&cid=C063H5S7NKE. I.e., when the replica comes back online, it fulfills all of its missed obligations.

ggevay commented 5 months ago

@benesch

For Q1, as a first version just for REFRESH MVs, I'm thinking about a simplified version of this epic:

Instead of adding new cluster options, we'd simply do auto replica management for a cluster if it has a REFRESH MV. We'd do this even if the cluster has other objects, i.e., we'd assume that any other objects are just in service of the REFRESH MVs. (For example, there can be several REFRESH MVs, in which case the user might want to have indexed views for common subexpressions of these MVs.)
The issue description says "While a replica is paused, its status is reported as paused in mz_cluster_replica_statuses", but I'm thinking to just add/remove a replica instead of creating new replica statuses. This would have the limitation that the user can't create more than 1 replicas (because we'd always create just 1 replica to perform a refresh), but this should be ok for now.

The design and implementation effort for this simplified version would be several times less than the full version in the issue description, so I'd say it's worth it to first do this even if the code is partly thrown away later when implementing the full version.

Later, users can move over to the full version by creating a new cluster with the explicit cluster options, and moving their objects to that cluster. Edit: Or the explicit cluster options could just override the simple version, i.e., if there are explicit cluster options then the simple version wouldn't touch that cluster. And then users wouldn't even need to move objects, but they could just add the explicit options, once available.

What do you think?

benesch commented 5 months ago

I'm all for simplifying the MVP as much as possible, but what I want to avoid is accumulating product/UX debt that we can't remove later. In particular, I'm afraid of doing anything automatically without the user explicitly declaring their intent by typing AUTO somewhere. Automatically managing a cluster when it gets a materialized view on it with REFRESH EVERY feels very surprising to me, and a decision that we can't really undo later because users will have come to rely on that behavior.

My counterproposal would be that we build out support for PAUSE = AUTO, RESUME = AUTO, as in:

CREATE CLUSTER c (PAUSE = AUTO, RESUME = AUTO);

And for the MVP we don't bother supporting any other combinations of PAUSE or RESUME behavior.

The issue description says "While a replica is paused, its status is reported as paused in mz_cluster_replica_statuses", but I'm thinking to just add/remove a replica instead of creating new replica statuses. This would have the limitation that the user can't create more than 1 replicas (because we'd always create just 1 replica to perform a refresh), but this should be ok for now.

I think this sounds reasonable. FWIW, I think supporting replication factors greater than 1 wouldn't be terribly hard. Just create n replicas when you need to, and drop all n when you're done. But also having multiple replicas maintaining cold data doesn't sound that useful, so I'm good with saying that you can only specify PAUSE = AUTO, RESUME = AUTO with REPLICATION FACTOR = 1. Seems backwards compatible to lift that restriction in the future.

The downside here is just UX. As a user looking through mz_cluster_replicas, how will I know that there will be replicas pending scheduling in the future for this cluster? Feels like something we can improve/refine over time, fo rsure.

For example, there can be several REFRESH MVs, in which case the user might want to have indexed views for common subexpressions of these MVs.

This is a very good point. The text in the issue description (largely my interpretation/transcription of a @frankmcsherry idea) contends that an index indicates user intent for fast random access to the data in the index, and therefore an obligation to keep the cluster online all the time. But you're totally right that users might create indexes to promote sharing of resources between different cold materialized views, without any intent or expectation of actually querying those indexes directly.

This feels like the most important point to shake out. Maybe it's worth an hour live conversation with you, me, and @frankmcsherry? (Plus whoever else would like to join from compute.)

The idea that an index must create permanent obligation was possibly just wrong. Maybe users should need to type CREATE INDEX ... WITH (SERVING OBLIGATION = {TRUE | FALSE}), and indexes only create an obligation if SERVING OBLIGATION is TRUE.

Anyway, tl;dr is I feel like there's a way to get to the simple implementation you have in mind while requiring users to be a bit more explicit about their desires, and that explicitness will ultimately allow us to build even more powerful forms of automatic cluster management in the future, without fighting with the MVP we want to build in this quarter.

ggevay commented 5 months ago

Anyway, tl;dr is I feel like there's a way to get to the simple implementation you have in mind while requiring users to be a bit more explicit about their desires, and that explicitness will ultimately allow us to build even more powerful forms of automatic cluster management in the future, without fighting with the MVP we want to build in this quarter.

Great, thank you @benesch! I've modified the work item in the Compute Q1 planning doc to say "Automated Cluster Scheduling for REFRESH MVs".

I'm afraid of doing anything automatically without the user explicitly declaring their intent by typing AUTO somewhere.

Ok, makes sense!

My counterproposal would be that we build out support for PAUSE = AUTO, RESUME = AUTO And for the MVP we don't bother supporting any other combinations of PAUSE or RESUME behavior.

PAUSE = AUTO, RESUME = AUTO might pose some non-trivial design questions that maybe we don't want to make a final decision on for now. One is about the indexes that are not for serving. Another random one is whether RESUME = AUTO would wake up the cluster if the user makes a SELECT to it while it's sleeping.

So, how about instead of PAUSE = AUTO, RESUME = AUTO, we'd have a special option value just for REFRESH MVs, e.g., PAUSE = REFRESH, RESUME = REFRESH (or maybe even just SCHEDULE = REFRESH)? This wouldn't look at any other objects besides the MVs with non-trivial REFRESH options. (I.e., wouldn't care about whether there are any running peeks or subscribes, or ON COMMIT MVs, or any indexes serving REFRESH MVs or on REFRESH MVs, ...) So the user would be essentially saying that she wants to use this cluster only to compute REFRESH MVs, and everything else on this cluster (e.g., indexes) are just in service of these MVs, but don't need to affect the scheduling.

so I'm good with saying that you can only specify PAUSE = AUTO, RESUME = AUTO with REPLICATION FACTOR = 1. The downside here is just UX. As a user looking through mz_cluster_replicas, how will I know that there will be replicas pending scheduling in the future for this cluster? Feels like something we can improve/refine over time, fo rsure.

Yes, we might later switch to replica pause/resume instead of adding/removing replicas. And in the meantime, I'll probably add mz_materialized_view_refresh_history (which would show the next refresh as "pending"), and then the user can see when something will be scheduled by looking at this internal table instead of mz_cluster_replicas.

benesch commented 5 months ago

PAUSE = AUTO, RESUME = AUTO might pose some non-trivial design questions that maybe we don't want to make a final decision on for now. One is about the indexes that are not for serving. Another random one is whether RESUME = AUTO would wake up the cluster if the user makes a SELECT to it while it's sleeping.

So, how about instead of PAUSE = AUTO, RESUME = AUTO, we'd have a special option value just for REFRESH MVs, e.g., PAUSE = REFRESH, RESUME = REFRESH (or maybe even just SCHEDULE = REFRESH)? This wouldn't look at any other objects besides the MVs with non-trivial REFRESH options. (I.e., wouldn't care about whether there are any running peeks or subscribes, or ON COMMIT MVs, or any indexes serving REFRESH MVs or on REFRESH MVs, ...) So the user would be essentially saying that she wants to use this cluster only to compute REFRESH MVs, and everything else on this cluster (e.g., indexes) are just in service of these MVs, but don't need to affect the scheduling.

I think SCHEDULE = REFRESH is good strawman syntax to get your initial PR(s) unblocked! It's succinct, obvious, and specific. Seems like a good way to indicate that a given cluster should be subject to the new autoscheduling logic that you're going to be iterating on.

That said, I'm still hopeful we might be able to sketch out enough of the long term design that we could use something more along the lines of PAUSE = AUTO, RESUME = AUTO as in the original proposal. I'd like to have the option to revisit the syntax at least once before we move to private preview with SCHEDULE = REFRESH.

One thought: perhaps we disallow running SELECTs on clusters that auto-resume for now. That way we could punt on the question of what auto-resuming means for a SELECT. Is there ever a reason that someone would need to run a SELECT on a cluster with REFRESH EVERY materialized views?

I sent a tentative calendar hold for Tuesday for you, me, and @frankmcsherry to discuss, and will follow up on Slack. In the meantime though, is using SCHEDULE = REFRESH as the working syntax enough to get you unblocked?

Yes, we might later switch to replica pause/resume instead of adding/removing replicas. And in the meantime, I'll probably add mz_materialized_view_refresh_history (which would show the next refresh as "pending"), and then the user can see when something will be scheduled by looking at this internal table instead of mz_cluster_replicas.

Unrelated to the above discussion, @morsapaes would like to lodge a request for adding mz_materialized_view_refresh_history (or its moral equivalent) sooner rather than later, to allow for better user visibility (possibly by way of new console features) into the refresh schedule.

ggevay commented 5 months ago

Linking the meeting notes from yesterday: https://www.notion.so/materialize/Compute-meeting-on-automatic-cluster-scheduling-ce353b8af52e449d8784241c4a1c0585

antiguru commented 5 months ago

(For completeness, linking my thoughts from way back on this subject: https://github.com/MaterializeInc/materialize/pull/20437.)

lfest commented 5 months ago

The consensus agreement is that Gabor will implement "Policy V1" outlined in the meeting notes referenced in the comment above.

benesch commented 4 months ago

The consensus agreement is that Gabor will implement "Policy V1" outlined in the meeting notes referenced in the comment above.

I've split out an epic for this (https://github.com/MaterializeInc/materialize/issues/25712), since the compute team is only planning to pick up this small piece of the work. That frees this more general purpose epic up to move to "Later" on the adapter team. cc @chaas

teskje commented 4 months ago

Is there ever a reason that someone would need to run a SELECT on a cluster with REFRESH EVERY materialized views?

Yes, querying introspection sources. If we disallow SELECTs on such clusters, debugging dataflow hydration would be limited to the information we send from the replicas to the controller (i.e. frontiers and operator hydration flags). Also the "cost-intensive objects" table in the Console would stop working.

benesch commented 4 months ago

Noting that once we introduce an mz_probe cluster (https://github.com/MaterializeInc/materialize/issues/25834#issue-2172989086), the mz_system cluster will only need to be live for a few minutes each hour, and would be a prime candidate for automatic pause/resume.

MaterializeInc / materialize