cmu-db / noisepage

Self-Driving Database Management System from Carnegie Mellon University
https://noise.page
MIT License
1.75k stars 502 forks source link

Nested ORDER BY and LIMIT crashes the DB #1423

Open jkosh44 opened 3 years ago

jkosh44 commented 3 years ago

Bug Report

Summary

According to the SQL standard ORDER BY is not allowed in nested queries (I can't actually find the SQL Standard to confirm this, but I found a handful of sites that have said this. The best source I can find is this MariaDB post: https://mariadb.com/kb/en/why-is-order-by-in-a-from-subquery-ignored/). Different systems handle this in different ways. From some brief research, these seem to be the possible options when we encounter an ORDER BY clause in a nested query:

  1. Respond back with an error saying that nested ORDER BYs aren't allowed in nested queries
  2. Ignore the nested ORDER BY
  3. Execute the nested ORDER BY
  4. Only execute the nested ORDER BY if it's accompanied by a LIMIT clause.
    • The rationale behind option 4 (I think) is that often, but not always, an ORDER BY in a nested query would not change the result of the query unless it's accompanied by a LIMIT.

Our system will ignore an ORDER BY clause unless it is in the outermost query: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/traffic_cop/traffic_cop_util.cpp#L36-L63 or unless it is accompanied by a LIMIT clause: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/optimizer/child_property_deriver.cpp#L92-L103

When we actually encounter both an ORDER BY and a LIMIT in a nested query then the database crashes. Specifically with a segfault on this line: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/include/optimizer/group_expression.h#L111 lowest_cost_table_ does not contain requirements.

Environment

OS: Ubuntu (LTS) 20.04

Compiler: GCC 7.0+

CMake Profile: Debug

Jenkins/CI: N/A

Steps to Reproduce

  1. Compile with the following args: -DCMAKE_BUILD_TYPE=Debug -DNOISEPAGE_USE_ASAN=ON
  2. Run noisepage with parallel execution turned off noisepage -parallel_execution=false
  3. Execute a nested query with an ORDER BY and a LIMIT

Expected Behavior

postgres=# CREATE TABLE foo (a int);
CREATE TABLE
postgres=# CREATE TABLE bar (a int);
CREATE TABLE
postgres=# SELECT a FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 1);
 a 
---
(0 rows)

Actual Behavior

noisepage=# CREATE TABLE foo (a int);
CREATE TABLE
noisepage=# CREATE TABLE bar (a int);
CREATE TABLE
noisepage=# SELECT a FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 1);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Discussion

ORDER BY Effects

I just briefly wanted to present some examples of when an ORDER BY could influence the result IF it was executed

For all of the following queries assume the following two tables have been created and filled with random data

CREATE TABLE foo (a int);
CREATE TABLE bar (a int);
  1. SELECT * FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a);

    Here the ORDER BY makes no difference. The result of the nested query is part of an IN predicate and therefore treated as an unordered list, so it doesn't matter how it's ordered.

  2. SELECT * FROM foo WHERE a IN (SELECT a FROM bar ORDER BY a LIMIT 2);

    Here the ORDER BY does make a difference. We only include the top two values of a in the right side of the IN predicate.

  3. SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2;

    This is probably a stupid query, but the ORDER BY does make a difference. The result of this query should be the same as SELECT a FROM foo ORDER BY a LIMIT 2;, however if we ignore the ORDER BY this is not necessarily the case. I actually tested this on postgres and it seems like the ORDER BY is executed.

    
    postgres=# CREATE TABLE foo (a int);
    CREATE TABLE
    postgres=# INSERT INTO foo VALUES(10);
    INSERT 0 1
    postgres=# INSERT INTO foo VALUES(15);
    INSERT 0 1
    postgres=# INSERT INTO foo VALUES(1);
    INSERT 0 1
    postgres=# SELECT a FROM foo;
    a  
    ----
    10
    15
    1
    (3 rows)

postgres=# SELECT * FROM (SELECT a FROM foo ORDER BY a) AS f LIMIT 2; a

1 10 (2 rows)



### Properties
`ORDER BY` clauses are implemented with properties in the optimizer. That means that while there is an `OrderByPlanNode` for physical plans (https://github.com/cmu-db/noisepage/blob/master/src/include/planner/plannodes/limit_plan_node.h), there is no LogicalOrderBy operator in the logical operator trees. Instead we have a `PropertySort` property (https://github.com/cmu-db/noisepage/blob/master/src/include/optimizer/properties.h).

Currently `PropertySort` is the only property in the optimizer and the implementation for deriving this property (linked above in this issue) is very specific to `ORDER BY` and very tightly coupled to the `LogicalLimit` operator. Understanding this will probably be important to fixing this issue. Also I have the following concerns with this approach (other than this bug that it's causing):
- I found it pretty confusing and it took me awhile to figure out how and why `PropertySort` was so tightly coupled with `LIMIT`. I think a comment or some documentation would go a long way here.
- This may make adding additional properties more difficult. 
  - If every property has it's own unique place and method of being derived then it can be very difficult to maintain and understand. 
  - It's possible that not every property will be so tightly coupled to some other node like `ORDER BY` is. That means that the approach we use in deriving `PropertySort`, where we save the information in some other node, MIGHT not be possible. I don't really know enough about properties though to know if this is true.
  - In #1414 we want to convert `LIMIT` into a property. This may break our current derivation of `PropertySort`.

I don't know enough about the optimizer to know if these are valid concerns, but thought they warranted some discussion. They probably don't need to be resolved for this issue but are somewhat related so I thought I'd bring them up.
thepinetree commented 3 years ago

Addressing some things based on my own observations, I agree with the refactor of LIMIT as a property, since it has much the same behavior of an ORDER BY (an additional node in the output, but one which can also be propagated down to lower nodes).

To my knowledge, the only property initially envisoned was a Sort Property, so at the moment I don't think that any new properties would not be able to be coupled to a particular node, or pushed down otherwise. LIMITs in particular should be able to be pushed down to the child Get though I'm not sure how nested queries break this, as it seems the currently do.

Our current rules for PropertySort should not break if a PropertyLimit is added because properties of a specific type are accessed with the GetPropertyOfType function in the PropertySet class, though if this is not done anywhere (i.e. just indexed into the property set), it should be replaced as such. However, this information will be more useful with the idea of an 'optional property,' one that might be satisfied by a child node, and will no longer require the output of a new Limit or OrderBy node where the property existed.

A bunch of property-related ideas are listed on #1421, if ok with you, I'd like to add a link to this issue in that checklist.

jkosh44 commented 3 years ago

@thepinetree Feel free to link this issue.

I guess I don't fully understand the difference between when we create a PropertySort here: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/traffic_cop/traffic_cop_util.cpp#L36-L63 and here: https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/optimizer/child_property_deriver.cpp#L92-L103 I had sort of assumed that they were similar ways of creating a property for the query in different places, but I guess that they are used for different reasons?

For the outer query a PropertySort will be created in both places, but for nested queries a PropertySort will only be created in the second block of code, which might be the cause of this error.

jkosh44 commented 3 years ago

Ignoring the following method (because I don't really understand it), https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/optimizer/child_property_deriver.cpp#L92-L103 we only derive properties for the outer query by checking if it has an ORDER BY clause https://github.com/cmu-db/noisepage/blob/520e0da7437d9e970e0fcb144a2f9939c35ba560/src/traffic_cop/traffic_cop_util.cpp#L36-L63

If we want to support a PropertyLimit, or support PropertySort for nested queries, or add another property that applies to nested queries I think that we'll need a new way of initially deriving and storing properties. A way that can derive properties for the outer and all inner queries and then stores those properties separately somehow. I'm not sure what the best way to do this is though. Maybe deriving the properties during the binder because we already traverse through all outer and nested queries. Maybe visiting all the statement nodes an additional time to get all the properties.

jkosh44 commented 3 years ago

https://github.com/jkosh44/noisepage/commit/fefcc60c4646a66019e31df16974b46405faa024 Fixes this issue. Though it's probably still worth considering how to propagate properties down into nested queries.

The change should wait until after #1422 is merged or be integrated into #1422. It's possible depending on the final state of #1422 that this change is no longer necessary (@thepinetree tagging for visibility).