Dataflow resource limits

MaterializeInc / materialize

The data warehouse for operational workloads.

https://materialize.com

Other

5.66k stars 457 forks source link

Dataflow resource limits #4257

Open cuongdo opened 3 years ago

cuongdo commented 3 years ago

One way to avoid OOMs is to let users specify resource limits for dataflows that Materialize enforces. Memory limits might be expressed in records (we have record counts now) or bytes (requires much more upfront work). Runtime limits, for one-off dataflows, might be expressed in wall clock time or CPU time.

This is a stub for a more detailed product spec.

@frankmcsherry: could you document your thoughts about what's needed to make progress on this?

@awang: we should consider this task or tasks that contribute to this goal for 0.6

cc @benesch @umanwizard

Memory limits will depend on #815. Runtime limits will depend on #2392.

krishmanoh2 commented 3 years ago

If we can have a basic parameter like max_memory_gb as a top level parameter, would be simpler to use.

umanwizard commented 3 years ago

@krishmanoh2 That would indeed be simple, but I'm not sure if it would be useful. Most users will be running Materialize alone on its own box, and it will be okay if it uses all the memory available. At that point, it doesn't really matter if Materialize crashes from exceeding the parameter, or from the Linux OOM killer reaping it.

Per-dataflow policies seem more useful, as they would prevent one bad query from bringing down the system.

Either solution seems pretty difficult/involved to implement.

krishmanoh2 commented 3 years ago

In reality, we do not know what else is going to be running on the system. With this parameter set to safe defaults, we can prevent OOM which goes towards first impressions. A stable system is better perception. This also allows for a user to size the environment correctly which goes towards costing.

if this parameter was set, how would mz behave when this threshold is exceeded?

umanwizard commented 3 years ago

Can the user already limit the memory used by the process, with something like ulimit ?

krishmanoh2 commented 3 years ago

It could be, it is not preferred though and also being dependent on the user to set this variable is not common in my experience with other products.

Flink has it as a parameter as well - https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup.html

benesch commented 3 years ago

I think the broader concern here, which I imagine we all agree on, is that it should be simple and straightforward to configure dataflow resource limits. If the limits are manually configured one dataflow at a time, I can see how that would be a drag.

Regardless, I think this discussion is putting the cart before the horse. We don't presently understand how to do dataflow resource limiting at all. (Unless someone wants to pipe up!) I think we need, at minimum, the bones of an implementation before we get too deep into the design of specific configuration variables.

chaas commented 10 months ago

Bumping that this came up again recently as a potential valuable feature to avoid OOMing when running expensive queries. It would be a better experience to slow down a query and emit a warning that the resource limit was hit, just prior to and instead of OOMing. That way the user can kill the query and either size up their cluster or write a more efficient query. The notice is important though, so a user knows that they hit a limit rather than just thinking Materialize is slow.