Open jkosh44 opened 3 years ago
Addressing some things based on my own observations, I agree with the refactor of LIMIT
as a property, since it has much the same behavior of an ORDER BY
(an additional node in the output, but one which can also be propagated down to lower nodes).
To my knowledge, the only property initially envisoned was a Sort Property, so at the moment I don't think that any new properties would not be able to be coupled to a particular node, or pushed down otherwise. LIMIT
s in particular should be able to be pushed down to the child Get
though I'm not sure how nested queries break this, as it seems the currently do.
Our current rules for PropertySort
should not break if a PropertyLimit
is added because properties of a specific type are accessed with the GetPropertyOfType
function in the PropertySet
class, though if this is not done anywhere (i.e. just indexed into the property set), it should be replaced as such. However, this information will be more useful with the idea of an 'optional property,' one that might be satisfied by a child node, and will no longer require the output of a new Limit or OrderBy node where the property existed.
A bunch of property-related ideas are listed on #1421, if ok with you, I'd like to add a link to this issue in that checklist.
@thepinetree Feel free to link this issue.
I guess I don't fully understand the difference between when we create a PropertySort
here: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/traffic_cop/traffic_cop_util.cpp#L36-L63 and here: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/optimizer/child_property_deriver.cpp#L92-L103 I had sort of assumed that they were similar ways of creating a property for the query in different places, but I guess that they are used for different reasons?
For the outer query a PropertySort
will be created in both places, but for nested queries a PropertySort
will only be created in the second block of code, which might be the cause of this error.
Ignoring the following method (because I don't really understand it), https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/optimizer/child_property_deriver.cpp#L92-L103 we only derive properties for the outer query by checking if it has an ORDER BY
clause https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/traffic_cop/traffic_cop_util.cpp#L36-L63
If we want to support a PropertyLimit
, or support PropertySort
for nested queries, or add another property that applies to nested queries I think that we'll need a new way of initially deriving and storing properties. A way that can derive properties for the outer and all inner queries and then stores those properties separately somehow. I'm not sure what the best way to do this is though. Maybe deriving the properties during the binder because we already traverse through all outer and nested queries. Maybe visiting all the statement nodes an additional time to get all the properties.
https://github.com/jkosh44/noisepage/commit/fefcc60c4646a66019e31df16974b46405faa024 Fixes this issue. Though it's probably still worth considering how to propagate properties down into nested queries.
The change should wait until after #1422 is merged or be integrated into #1422. It's possible depending on the final state of #1422 that this change is no longer necessary (@thepinetree tagging for visibility).
Bug Report
Summary
According to the SQL standard
ORDER BY
is not allowed in nested queries (I can't actually find the SQL Standard to confirm this, but I found a handful of sites that have said this. The best source I can find is this MariaDB post: https://mariadb.com/kb/en/why-is-order-by-in-a-from-subquery-ignored/). Different systems handle this in different ways. From some brief research, these seem to be the possible options when we encounter anORDER BY
clause in a nested query:ORDER BY
s aren't allowed in nested queriesORDER BY
ORDER BY
ORDER BY
if it's accompanied by aLIMIT
clause.ORDER BY
in a nested query would not change the result of the query unless it's accompanied by aLIMIT
.Our system will ignore an
ORDER BY
clause unless it is in the outermost query: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/traffic_cop/traffic_cop_util.cpp#L36-L63 or unless it is accompanied by aLIMIT
clause: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/optimizer/child_property_deriver.cpp#L92-L103When we actually encounter both an
ORDER BY
and aLIMIT
in a nested query then the database crashes. Specifically with a segfault on this line: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/include/optimizer/group_expression.h#L111lowest_cost_table_
does not containrequirements
.Environment
OS: Ubuntu (LTS) 20.04
Compiler: GCC 7.0+
CMake Profile:
Debug
Jenkins/CI: N/A
Steps to Reproduce
ORDER BY
and aLIMIT
Expected Behavior
Actual Behavior
Discussion
ORDER BY Effects
I just briefly wanted to present some examples of when an
ORDER BY
could influence the result IF it was executedFor all of the following queries assume the following two tables have been created and filled with random data
Here the
ORDER BY
makes no difference. The result of the nested query is part of anIN
predicate and therefore treated as an unordered list, so it doesn't matter how it's ordered.Here the
ORDER BY
does make a difference. We only include the top two values ofa
in the right side of theIN
predicate.This is probably a stupid query, but the ORDER BY does make a difference. The result of this query should be the same as
SELECT a FROM foo ORDER BY a LIMIT 2;
, however if we ignore theORDER BY
this is not necessarily the case. I actually tested this on postgres and it seems like theORDER BY
is executed.postgres=# SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2; a
1 10 (2 rows)