Open benesch opened 8 months ago
A thought from watching @parkerhendo's prototype of pausing/resuming with one button in the console (link): perhaps PAUSE
and RESUME
should be first-class verbs in the database!
Imagine:
ALTER CLUSTER ... (PAUSED);
ALTER CLUSTER ... (PAUSED = FALSE);
And that state played nice with any scheduled pause/resume—basically just lets you override the specified policy on a one-off basis.
One additional wrinkle: how does pause/resume interact with the max_credit_consumption_rate
system limit? Users could easily find themselves in a situation where when they create a cluster schedule they have plenty of credits for each resumption, but as time goes on they use up the credits for other workloads. Should each resumption fail if it would exceed the max credit limit? Or should a cluster with scheduled pause/resume take a "credit hold" for the max credits it will ever need, even if it's not actively using those credits?
This might have some interesting implications on our Kube capacity, as it allows for more bursty pod workloads than what we have today. I could imagine customers wanting to quickly crank through updating a REFRESH EVERY
MV by using a larger-than-average cluster size just for the duration needed (in fact, already saw this floated in one customer discussion). It'd be a bummer for the customer if the cluster doesn't fit within our existing nodes, and then those periodic refreshes need to wait for nodes to spin up before beginning rehydration each time, or maybe we can't even get the capacity.
Could also imagine getting peaks around common business hours, which is something we'd want to prep for long-term.
Can't think of any issues we'd need to address before rolling out the v1, but we'll want to keep tabs on how this feature is being used to understand whether we need to tweak our orchestration going forward.
It'd be a bummer for the customer if the cluster doesn't fit within our existing nodes, and then those periodic refreshes need to wait for nodes to spin up before beginning rehydration each time, or maybe we can't even get the capacity.
I think with the current RESUME
syntax proposal we would be able to forecast the schedule with high accuracy, so you should be able to allocate nodes for the requested cluster sizes upfront.
It'd be a bummer for the customer if the cluster doesn't fit within our existing nodes, and then those periodic refreshes need to wait for nodes to spin up before beginning rehydration each time, or maybe we can't even get the capacity.
Yes, this is very much a valid danger. In the meantime, we discussed this here and here. For now, we are not going to alter the planned semantics (e.g., in logical time we'll still refresh exactly at the moment specified by the user), but we'll make it clear to users that we can't offer any hard guarantees that the refreshes will actually happen on time. If a user needs hard guarantees, we'll advise to use a normal, always on mat view.
In the long run, there are various things we can do to increase the chances of successfully completing the refresh on time, e.g.,
The ALTER CLUSTER
syntax made me think about a scenario in which a cluster supporting materialized views with user-specified refresh intervals is not on AUTO
scheduling or is and then is overridden by the user. Will we ignore obligations on clusters which Materialize does not schedule until they come online again? How will we handle "missed" batches of updates when that happens?
Will we ignore obligations on clusters which Materialize does not schedule until they come online again?
Yes.
How will we handle "missed" batches of updates when that happens?
The same way we handle "missed" updates due to AWS capacity problems: https://materializeinc.slack.com/archives/C063H5S7NKE/p1700138271783999?thread_ts=1699543250.405409&cid=C063H5S7NKE. I.e., when the replica comes back online, it fulfills all of its missed obligations.
@benesch
For Q1, as a first version just for REFRESH MVs, I'm thinking about a simplified version of this epic:
The design and implementation effort for this simplified version would be several times less than the full version in the issue description, so I'd say it's worth it to first do this even if the code is partly thrown away later when implementing the full version.
Later, users can move over to the full version by creating a new cluster with the explicit cluster options, and moving their objects to that cluster. Edit: Or the explicit cluster options could just override the simple version, i.e., if there are explicit cluster options then the simple version wouldn't touch that cluster. And then users wouldn't even need to move objects, but they could just add the explicit options, once available.
What do you think?
I'm all for simplifying the MVP as much as possible, but what I want to avoid is accumulating product/UX debt that we can't remove later. In particular, I'm afraid of doing anything automatically without the user explicitly declaring their intent by typing AUTO
somewhere. Automatically managing a cluster when it gets a materialized view on it with REFRESH EVERY
feels very surprising to me, and a decision that we can't really undo later because users will have come to rely on that behavior.
My counterproposal would be that we build out support for PAUSE = AUTO, RESUME = AUTO
, as in:
CREATE CLUSTER c (PAUSE = AUTO, RESUME = AUTO);
And for the MVP we don't bother supporting any other combinations of PAUSE
or RESUME
behavior.
The issue description says "While a replica is paused, its status is reported as paused in mz_cluster_replica_statuses", but I'm thinking to just add/remove a replica instead of creating new replica statuses. This would have the limitation that the user can't create more than 1 replicas (because we'd always create just 1 replica to perform a refresh), but this should be ok for now.
I think this sounds reasonable. FWIW, I think supporting replication factors greater than 1 wouldn't be terribly hard. Just create n replicas when you need to, and drop all n when you're done. But also having multiple replicas maintaining cold data doesn't sound that useful, so I'm good with saying that you can only specify PAUSE = AUTO, RESUME = AUTO
with REPLICATION FACTOR = 1
. Seems backwards compatible to lift that restriction in the future.
The downside here is just UX. As a user looking through mz_cluster_replicas
, how will I know that there will be replicas pending scheduling in the future for this cluster? Feels like something we can improve/refine over time, fo rsure.
For example, there can be several REFRESH MVs, in which case the user might want to have indexed views for common subexpressions of these MVs.
This is a very good point. The text in the issue description (largely my interpretation/transcription of a @frankmcsherry idea) contends that an index indicates user intent for fast random access to the data in the index, and therefore an obligation to keep the cluster online all the time. But you're totally right that users might create indexes to promote sharing of resources between different cold materialized views, without any intent or expectation of actually querying those indexes directly.
This feels like the most important point to shake out. Maybe it's worth an hour live conversation with you, me, and @frankmcsherry? (Plus whoever else would like to join from compute.)
The idea that an index must create permanent obligation was possibly just wrong. Maybe users should need to type CREATE INDEX ... WITH (SERVING OBLIGATION = {TRUE | FALSE})
, and indexes only create an obligation if SERVING OBLIGATION
is TRUE
.
Anyway, tl;dr is I feel like there's a way to get to the simple implementation you have in mind while requiring users to be a bit more explicit about their desires, and that explicitness will ultimately allow us to build even more powerful forms of automatic cluster management in the future, without fighting with the MVP we want to build in this quarter.
Anyway, tl;dr is I feel like there's a way to get to the simple implementation you have in mind while requiring users to be a bit more explicit about their desires, and that explicitness will ultimately allow us to build even more powerful forms of automatic cluster management in the future, without fighting with the MVP we want to build in this quarter.
Great, thank you @benesch! I've modified the work item in the Compute Q1 planning doc to say "Automated Cluster Scheduling for REFRESH MVs".
I'm afraid of doing anything automatically without the user explicitly declaring their intent by typing AUTO somewhere.
Ok, makes sense!
My counterproposal would be that we build out support for PAUSE = AUTO, RESUME = AUTO And for the MVP we don't bother supporting any other combinations of PAUSE or RESUME behavior.
PAUSE = AUTO, RESUME = AUTO might pose some non-trivial design questions that maybe we don't want to make a final decision on for now. One is about the indexes that are not for serving. Another random one is whether RESUME = AUTO would wake up the cluster if the user makes a SELECT to it while it's sleeping.
So, how about instead of PAUSE = AUTO, RESUME = AUTO, we'd have a special option value just for REFRESH MVs, e.g., PAUSE = REFRESH, RESUME = REFRESH (or maybe even just SCHEDULE = REFRESH)? This wouldn't look at any other objects besides the MVs with non-trivial REFRESH options. (I.e., wouldn't care about whether there are any running peeks or subscribes, or ON COMMIT MVs, or any indexes serving REFRESH MVs or on REFRESH MVs, ...) So the user would be essentially saying that she wants to use this cluster only to compute REFRESH MVs, and everything else on this cluster (e.g., indexes) are just in service of these MVs, but don't need to affect the scheduling.
so I'm good with saying that you can only specify PAUSE = AUTO, RESUME = AUTO with REPLICATION FACTOR = 1. The downside here is just UX. As a user looking through mz_cluster_replicas, how will I know that there will be replicas pending scheduling in the future for this cluster? Feels like something we can improve/refine over time, fo rsure.
Yes, we might later switch to replica pause/resume instead of adding/removing replicas. And in the meantime, I'll probably add mz_materialized_view_refresh_history
(which would show the next refresh as "pending"), and then the user can see when something will be scheduled by looking at this internal table instead of mz_cluster_replicas
.
PAUSE = AUTO, RESUME = AUTO might pose some non-trivial design questions that maybe we don't want to make a final decision on for now. One is about the indexes that are not for serving. Another random one is whether RESUME = AUTO would wake up the cluster if the user makes a SELECT to it while it's sleeping.
So, how about instead of PAUSE = AUTO, RESUME = AUTO, we'd have a special option value just for REFRESH MVs, e.g., PAUSE = REFRESH, RESUME = REFRESH (or maybe even just SCHEDULE = REFRESH)? This wouldn't look at any other objects besides the MVs with non-trivial REFRESH options. (I.e., wouldn't care about whether there are any running peeks or subscribes, or ON COMMIT MVs, or any indexes serving REFRESH MVs or on REFRESH MVs, ...) So the user would be essentially saying that she wants to use this cluster only to compute REFRESH MVs, and everything else on this cluster (e.g., indexes) are just in service of these MVs, but don't need to affect the scheduling.
I think SCHEDULE = REFRESH
is good strawman syntax to get your initial PR(s) unblocked! It's succinct, obvious, and specific. Seems like a good way to indicate that a given cluster should be subject to the new autoscheduling logic that you're going to be iterating on.
That said, I'm still hopeful we might be able to sketch out enough of the long term design that we could use something more along the lines of PAUSE = AUTO, RESUME = AUTO
as in the original proposal. I'd like to have the option to revisit the syntax at least once before we move to private preview with SCHEDULE = REFRESH
.
One thought: perhaps we disallow running SELECT
s on clusters that auto-resume for now. That way we could punt on the question of what auto-resuming means for a SELECT
. Is there ever a reason that someone would need to run a SELECT
on a cluster with REFRESH EVERY
materialized views?
I sent a tentative calendar hold for Tuesday for you, me, and @frankmcsherry to discuss, and will follow up on Slack. In the meantime though, is using SCHEDULE = REFRESH
as the working syntax enough to get you unblocked?
Yes, we might later switch to replica pause/resume instead of adding/removing replicas. And in the meantime, I'll probably add
mz_materialized_view_refresh_history
(which would show the next refresh as "pending"), and then the user can see when something will be scheduled by looking at this internal table instead ofmz_cluster_replicas
.
Unrelated to the above discussion, @morsapaes would like to lodge a request for adding mz_materialized_view_refresh_history
(or its moral equivalent) sooner rather than later, to allow for better user visibility (possibly by way of new console features) into the refresh schedule.
Linking the meeting notes from yesterday: https://www.notion.so/materialize/Compute-meeting-on-automatic-cluster-scheduling-ce353b8af52e449d8784241c4a1c0585
(For completeness, linking my thoughts from way back on this subject: https://github.com/MaterializeInc/materialize/pull/20437.)
The consensus agreement is that Gabor will implement "Policy V1" outlined in the meeting notes referenced in the comment above.
The consensus agreement is that Gabor will implement "Policy V1" outlined in the meeting notes referenced in the comment above.
I've split out an epic for this (https://github.com/MaterializeInc/materialize/issues/25712), since the compute team is only planning to pick up this small piece of the work. That frees this more general purpose epic up to move to "Later" on the adapter team. cc @chaas
Is there ever a reason that someone would need to run a SELECT on a cluster with REFRESH EVERY materialized views?
Yes, querying introspection sources. If we disallow SELECTs on such clusters, debugging dataflow hydration would be limited to the information we send from the replicas to the controller (i.e. frontiers and operator hydration flags). Also the "cost-intensive objects" table in the Console would stop working.
Noting that once we introduce an mz_probe
cluster (https://github.com/MaterializeInc/materialize/issues/25834#issue-2172989086), the mz_system
cluster will only need to be live for a few minutes each hour, and would be a prime candidate for automatic pause/resume.
Product outcome
Users can cut costs by configuring their clusters to automatically pause and resume as appropriate for their workload.
Discovery
Background
Pausing/resuming a cluster will be particularly powerful when coupled with the new
REFRESH
modifier on materialized views. A view that is refreshed only daily needs only use compute resources once a day.See also @chuck-alt-delete's take on why this feature is important: https://github.com/MaterializeInc/product/issues/215#issuecomment-1413024497.
Design
Here is the barest sketch of a design:
The
PAUSE
operation specifies a policy for "pausing" the cluster (i.e., dropping all replicas of the cluster):AFTER <interval>
indicates that the cluster should be paused at<cluster replica creation time> + <interval>
AUTO
indicates that the cluster should pause when it has no remaining obligations. A runningSELECT
orSUBSCRIBE
statement is an obligation. An index, sink, or materialized view withREFRESH = ON COMMIT
semantics creates a continual obligation at the next instant. A materialized view with a looser refresh policy creates an obligation at the next scheduled refresh.The
RESUME
operation specifies a policy for "resuming" the cluster (i.e., recreating the replicas indicated by the replication factor):AUTO
indicates that the cluster should resume ahead of its next obligation. The exact semantics of "ahead" need to be further specified. Ideally, the cluster would resume far enough ahead of its next obligation that it would be fully hydrated at the time of the obligation.ON SCHEDULE '<cron expression>'
indicates that the cluster should be resumed according to the schedule specified by the cron expression.EVERY <interval>
indicates that the cluster should be resumed on the specified interval.The
PAUSE
andRESUME
options are only valid for use with managed clusters. While a replica is paused, its status is reported aspaused
inmz_cluster_replica_statuses
.There are many questions to sort out:
Use cases
There are three known major use cases for automated pause/resume:
REFRESH = EVERY ...
materialized view. If a view is only refreshed daily, and it takes only 30m to hydrate the state for the view.PAUSE = AFTER '6 hours'
will ensure the cluster is spun down even if the developer forgets about it or if CI fails to execute cleanup successfully.mz_introspection
cluster could be made to spin down automatically when no one is using the account, and could spin back up automatically the next time someone issues a query against it.Work items
Pending further discovery work.
See also
Decision log
REFRESH EVERY
matviews (https://github.com/MaterializeInc/materialize/issues/25712).