MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.66k stars 457 forks source link

Warn users about creating a view with a cross join #4176

Open wangandi opened 3 years ago

wangandi commented 3 years ago

Requested by a prospect.

if there exist potential pitfalls that a user could run into that would result in lots of memory to be used but may be difficult for us to fix any time soon, could we at least emit a warning at view creation time containing the workaround?

For example: in the case of #4171, a useful warning that we could emit would read as, WARNING: "bar"."a" (type "int4") has been implicitly cast to type "int8". Index(es) on "bar(a)" will not be used by this view. Either recreate "bar" with "a" as an "int8" column, or replace index(es) with index(es) on "bar(a::int8)".

Some kind of SQL linter should be able to detect problematic queries and generate an appropriate warning message.

wangandi commented 3 years ago

Personal thoughts: Given the fact that milestone 0.6 includes adding guardrails to constrain the amount of damage a poorly written query can do, what do people think about going one step further with the warning and not materializing the view immediately if we think it's bad?

My rough idea is that if the user runs: CREATE VIEW bad <bad_query>: no change from prior behavior besides emitting the warning. CREATE MATERIALIZED VIEW bad <bad_query>: create the view, but don't materialize it. emit a message View was created but not materialized due to <warning>. If you are certain that you want to materialize this view, run "CREATE INDEX on bad"

Advantages I perceive of doing the above addition to enabling #2392:

benesch commented 3 years ago

The idea of issuing warnings/notices about bad plans seems great to me!

I don't like the idea of attempting to prevent bad views, though—I think it sets a bar for ourselves that will be impossible to meet. It is impossible in general to prohibit all "bad" plans. Not only is the meaning of "bad" extremely subjective, but the space of plans is just far too complicated to ever be sure that we've covered a meaningful subset of the bad plans we might generate.

There are only two bulletproof solutions to avoiding OOMs that I see. The first is to test all queries on a staging server first, so that you can estimate their performance. The second is to build dataflow resource limits within Materialize, so that we can kill/halt dataflows that exceed their limits.

I fear that if we build some heuristics to reject some classes of "bad" plans, users will become too reliant on being able to chuck whatever they want at production Materialize instances without reviewing the plans or testing on a staging server first, and will be very upset when they stumble across a bad plan that we don't yet detect.

benesch commented 3 years ago

Also, it seems to me that CREATE VIEW bad ...; SELECT * FROM bad is just as likely to be dangerous as CREATE MATERIALIZED VIEW bad?

wangandi commented 3 years ago

Yes, CREATE VIEW bad ...; SELECT * FROM bad would be just as dangerous but my thought would be that hopefully the user would see the warning and think about it before running the select * from bad. Also, I'm hoping it is generally known that CREATE VIEW anything_non_trivial; SELECT * FROM anything_non_trivial is bad.

cuongdo commented 3 years ago

Putting all queries on a staging server is good practice but puts all the onus on the user. It's something we should recommend, but I believe that we should do more to give the user a chance to prevent an errant view or query from bringing down their Materialize instance

Are dataflow resource limits possible in the 0.6 timeframe (~6-8 weeks from now)?

Also, I think we should further consider adding heuristics to reject bad plans. A linter for plans won't be perfect, but it will reduce the chances of user error. Similarly, linters for various programming languages don't catch everything, but they help prevent classes of errors from occurring.

benesch commented 3 years ago

Right, but any effort spent on writing a linter is effort that could instead be spent improving the optimizer. I think there are some very targeted cases that are quick enough wins to be worth it (like cross joins), but beyond that I’m not so sure.

As I mentioned above I feel pretty strongly that such a linter needs to be advisory only. If a bad plan generates a hard error via some syntaxes and not others, I think that’s going to be confusing to users and hard for automated tools to cope with.

On Wed, Sep 9, 2020 at 7:19 PM Cuong Do notifications@github.com wrote:

Putting all queries on a staging server is good practice but puts all the onus on the user. It's something we should recommend, but I believe that we should do more to give the user a chance to prevent an errant view or query from bringing down their Materialize instance

Are dataflow resource limits possible in the 0.6 timeframe (~6-8 weeks from now)?

Also, I think we should further consider adding heuristics to reject bad plans. A linter for plans won't be perfect, but it will reduce the chances of user error. Similarly, linters for various programming languages don't catch everything, but they help prevent classes of errors from occurring.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/4176#issuecomment-689873725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIEPDLQBPCYQLIQOHBLSFAEP3ANCNFSM4RDL6OTQ .

damienhollis commented 3 years ago

From a user perspective, I think a warning is ok but you need to be careful not to give end users the impression that you are able to detect all situations where there will be high memory use and for users to assume the absence of a warning means that the view won't use a lot of memory.

I would prefer if materialize was able to guard against using too much memory. I agree that using a staging environment is good practice if we are building the views. However, I can imagine a situation where we might automatically generate materialized views in production based on summary data wanted by an end user. In such a case, we'd really want to be able to just create the view in production without concern of crashing materialize. That being said, it is unclear what we'd do if materialize didn't create the view because of a lack of memory. I guess ideally (and I realise this is a completely different feature) materialize would overflow to disk (like a database does) to handle the extra memory requirements.

frankmcsherry commented 3 years ago

There are only two bulletproof solutions to avoiding OOMs that I see.

Another one that seems not-bad to me is to first perform the diagnostic queries that would inform you about the expected resources. For example, before performing a join, you can count the number of records with each key, and multiply the results. Before performing a group by, you could count-distinct the keys themselves (though, this could OOM similarly).

I think only the "harden dataflows" is bulletproof. Anything planning based will have to deal with the fact that data change. As inputs grow over time, even for a fixed dataflow memory use will probably climb (as businesses get more customers, more products, do more things).

At the same time, I don't think this is a new problem. RDBMSs have had to deal with similar issues before. It starts as a tool, not an autonomous data management infrastructure. We should file down the pointy edges, but it doesn't need to start with self-piloting.

frankmcsherry commented 3 years ago

CREATE MATERIALIZED VIEW bad <bad_query>: create the view, but don't materialize it.

I kind of like this, actually. At least, it's super easy to type CREATE MATERIALIZED VIEW without thinking much (by definition, you are typing it before telling the system what the view is). CREATE INDEX is certainly an escape hatch. I'm personally more a fan of "stage queries first; load data in progressively", but if we had a good query linter that could point out e.g. cross joins this seems like a good place to warn folks.

cuongdo commented 3 years ago

I think @benesch's idea is a solid one:

The second is to build dataflow resource limits within Materialize, so that we can kill/halt dataflows that exceed their limits.

I'm not clear on the feasibility of this, but I'll file a separate issue if one doesn't already exist.

frankmcsherry commented 3 years ago

I think @benesch's idea is a solid one:

Can you clarify which one?

frankmcsherry commented 3 years ago

The issue for shutting down dataflows is https://github.com/MaterializeInc/materialize/issues/400

benesch commented 3 years ago

I'm not sure I said this explicitly before, I have a lot of concerns about any design whereby CREATE MATERIALIZED VIEW does not actually create a materialized view. It will be exceptionally confusing for CREATE MATERIALIZED VIEW to succeed but create a non-materialized view. It might work ok in the specific case of interactive use via psql, but automated tooling doesn't typically check notices/warnings, nor are the messages structured enough to be actionable for a machine. Ditto for non-psql shells—I'll bet mzcli doesn't handle warnings/notices correctly.

I think the best thing to do is to just issue an advisory notice/warning and hope the user sees it.

If we feel we must add more of a guardrail, I think the one way we can build this into interactive use in Materialize is to have a session variable like safe_mode that defaults to on for interactive sessions and off for non interactive sessions. (How we determine whether a session is interactive is a bit of an open question, but maybe psql exposes something, or we can sniff the application_name, or something...) If safe_mode is on, we can outright reject CREATE MATERIALIZED VIEW <cross join expr> with instructions on how to type CREATE VIEW ...; CREATE DEFAULT INDEX ON .... This solves the surprise issue (CREATE MATERIALIZED VIEW would not sometimes silently create a non-materialized view), but I think this has the potential to be annoying as often as it is helpful.

cuongdo commented 3 years ago

I've narrowed the scope of this issue to just warning about cross joins. If there are other warnings we'd like to introduce, let's file separate issues for those. Also assigned to 0.6. cc @awang

maddyblue commented 3 years ago

Unassigning the milestones until we figure out what to do with this class of issue (there's a bunch).