frankmcsherry / blog

Some notes on things I find interesting and important.
1.96k stars 178 forks source link

arrangements fixes #47

Closed maddyblue closed 4 years ago

maddyblue commented 4 years ago

Is there a list of all expressions and if they are stateless or not? this might make it easier for people looking at explain plans to figure out what could create arrangements.

Arrangements are indexed by some key. do those keys point to a full copy of the original row, some subsets of its rows, or a reference to a single copy of the value? for example, assume i had some table with an INT key and some very large string that only ever appears in a projection, is that large string copied around to each arrangement in which the int key is present, or is only a reference of it or the row its in copied around? the blog post says a few times memory is related to number of keys, but skips this detail.

Is it possible to tell the optimizer to prefer plans that reduce memory or reduce cpu? do such plans exist or even make sense? Under "inspect query plans" it says you may need to "reframe your query". Is that a thing an optimizer could help with?

Is Reduce produced mostly/only by GROUP BY?

Is TopK produced only by ORDER BY LIMIT?

Under Threshold exprs: this is the only use of the word accumulations and it is undefined, what does accumulations mean? SQL example?

Under "Inspect query plans": first use of "stateful expression". are these the same as arrangements? The asks for the user here ("see if you can recognize ...", "determine if you believe ...") require expertise and experience that may be unachievable to most users. Might be better to reframe those requests and/or add some specific bad things to look for?

There's a lot of good, specific suggestions for ways users can change their queries that mz cannot automatically do because they might be unsafe. Is it worth automating this detection so users can dump their query into a thing and it says "hey, if your data has property X then you should change to ". For example it could have a rule that looks for UNION and tells the user "if your data doesn't require dedups, change this to UNION ALL". This could be some new flag to EXPLAIN.

frankmcsherry commented 4 years ago

These changes look great. Here are some question answers, though the meta-point that they remain questions the reader might have still looms.

Is there a list of all expressions and if they are stateless or not? this might make it easier for people looking at explain plans to figure out what could create arrangements.

In the code, is the best statement I can make. They change a bit, and there are proposals to add some new ones. I'm not sure committing to them here helps vs ensuring that each stateful operator does show up on the list (which I think is true).

Arrangements are indexed by some key. do those keys point to a full copy of the original row, some subsets of its rows, or a reference to a single copy of the value?

Arrangements do not know about "original rows". They contain the data that you present at them, which can be both good and bad. For example, if you project out the nasty string, it never shows up in an arrangement. That is good. If you don't project out the string and build five indexes, we'll have five copies of the string (v one).

There is some blog text about how to use DD to do string interning, and generally "late materialization" is a thing one can do for joins, trading away query complexity (adding in values later requires more operators) for reduced resources overall.

Is it possible to tell the optimizer to prefer plans that reduce memory or reduce cpu? do such plans exist or even make sense? Under "inspect query plans" it says you may need to "reframe your query". Is that a thing an optimizer could help with?

Not presently, yes, and yes. The optimizer currently applies transformation to reduce worst-case memory footprint, and does not apply transformation that might increase this. It has space to become substantially smarter in the future.

Is Reduce produced mostly/only by GROUP BY?

I believe this is correct. A degenerate form of Reduce comes from Distinct which is introduced in correlated subqueries, but the easiest way to get a Reduce is GROUP BY.

Is TopK produced only by ORDER BY LIMIT?

I don't know of a way that it shows up otherwise (the correlated subquery case still requires an ORDER BY LIMIT in the subquery).

Under Threshold exprs: this is the only use of the word accumulations and it is undefined, what does accumulations mean? SQL example?

Example added (informally, if you tried to implement EXCEPT ALL by subtracting the second collection from the first, things might go negative. Threshold corrects that).

Under "Inspect query plans": first use of "stateful expression". are these the same as arrangements?

Meant to be the complement of "stateless" but will clarify!

The asks for the user here ("see if you can recognize ...", "determine if you believe ...") require expertise and experience that may be unachievable to most users. Might be better to reframe those requests and/or add some specific bad things to look for?

I don't think we are there yet, unfortunately. I agree that the final form should look more like that.

Is it worth automating this detection so users can dump their query into a thing and it says "hey, if your data has property X then you should change to ". For example it could have a rule that looks for UNION and tells the user "if your data doesn't require dedups, change this to UNION ALL". This could be some new flag to EXPLAIN.

This sounds good, but we'll have to see whether we have enough automated knowledge for this. For example, the UNION to UNION ALL thing applies equally well to normal SQL queries; what tools / idioms would users expect there to help them improve their performance?