AlaSQL / alasql

AlaSQL.js - JavaScript SQL database for browser and Node.js. Handles both traditional relational tables and nested JSON data (NoSQL). Export, store, and import data from localStorage, IndexedDB, or Excel.
http://alasql.org
MIT License
7.02k stars 654 forks source link

OUTER JOIN implementation is incomplete. #518

Open billschaller opened 8 years ago

billschaller commented 8 years ago
if(typeof exports === 'object') {
    var assert = require("assert");
    var alasql = require('..');
  var _ = require('lodash');
} else {
    __dirname = '.';
};

describe('Test OUTER JOIN on more than 2 tables', function() {

  it('1. Test OUTER JOIN on more than 2 tables', function(done) {
    var data = {
      COLORS: [{id:1,name:'red'},{id:2,name:'blue'},{id:3,name:'orange'}],            
      FRUITS:[{id:1,name:'apple'},{id:2,name:'grape'},{id:3,name:'orange'}],
      MASCOTS:[{id:1,name:'redsox'},{id:2,name:'whitesox'},{id:3,name:'orange'}]
    };

    res = alasql('select t0.name t0n ,t1.name t1n, t2.name t2n from ? t0 outer join ? t1 on t1.name = t0.name outer join ? t2 on t2.name = t0.name or t2.name = t1.name',[data.COLORS,data.FRUITS, data.MASCOTS]);
    assert.deepEqual(res.data, [
    {t0n: 'red', t1n: undefined, t2n: undefined},
    {t0n: 'blue', t1n: undefined, t2n: undefined},
    {t0n: 'orange', t1n: 'orange', t2n: 'orange'},
    {t0n: undefined, t1n: 'apple', t2n: undefined},
    {t0n: undefined, t1n: 'grape', t2n: undefined},
    {t0n: undefined, t1n: undefined, t2n: 'redsox'},
    {t0n: undefined, t1n: undefined, t2n: 'whitesox'}
    ]);
    done();
  });
});
billschaller commented 8 years ago

The outer join implementation should include all rows that match the predicate, and all rows from both the right and left which do not match the predicate. While evaluating not matched rows, the join code seems to backtrack up one join level, then include rows that don't match, but with garbage data from earlier in the join stack.

I think maybe if you add a var query.matched, and track what rows are matched (or not matched) during the entire join, and then at the end go back and include all non-matched rows, it would fix the issue. Another solution might be to just force evaluation of a NULL row for the left side of all outer joins if the end of the loop is reached and all rows on the right side have not been matched.

Not sure how best to solve -- any thoughts?

agershun commented 8 years ago

Current implementation gives:

[ { t0n: 'red', t1n: undefined, t2n: undefined },
  { t0n: 'red', t1n: undefined, t2n: 'redsox' },
  { t0n: 'red', t1n: undefined, t2n: 'whitesox' },
  { t0n: 'red', t1n: undefined, t2n: 'orange' },
  { t0n: 'blue', t1n: undefined, t2n: undefined },
  { t0n: 'blue', t1n: undefined, t2n: 'redsox' },
  { t0n: 'blue', t1n: undefined, t2n: 'whitesox' },
  { t0n: 'blue', t1n: undefined, t2n: 'orange' },
  { t0n: 'orange', t1n: 'orange', t2n: 'orange' },
  { t0n: 'orange', t1n: undefined, t2n: 'redsox' },
  { t0n: 'orange', t1n: undefined, t2n: 'whitesox' },
  { t0n: undefined, t1n: 'apple', t2n: undefined },
  { t0n: undefined, t1n: 'grape', t2n: undefined } ]

I will check.

agershun commented 8 years ago

The current algorithm works on the similar way you described:

The outer join implementation should include all rows that match the predicate, and all rows from both the right and left which do not match the predicate. While evaluating not matched rows, the join code seems to backtrack up one join level, then include rows that don't match, but with garbage data from earlier in the join stack.

I will check the output of other SQL realizations.

billschaller commented 8 years ago

I think the issue is in this block: https://github.com/agershun/alasql/blob/develop/src/39dojoin.js#L145

I think this works for a single join, but once there are more than one join, it starts emitting garbage because it only clears the scope for the middle level. Nested loop join gets complex in this case. Either you'd have to fully materialize each join before evaluating the next, or you would have to track whether every row of every source has been matched. If not matched, it would be evaluated with any prior table aliases in the scope set to null. This would handle cases where your output should have {t0n: undefined, t1n: 'purple', t2n: 'purple'} or {t0n: undefined, t1n: undefined, t2n: 'purple'}.

PostgreSQL won't even do an outer join without a sargable predicate, because it only does merge joins and hash joins for outer joins. I will grab the output of SQL server today and post it here. I don't think mysql does full outer joins at all, which is horrible to try and work around.

billschaller commented 8 years ago

amending the previous example to add rows existing in both t1 and t2 but not t0:

if(typeof exports === 'object') {
    var assert = require("assert");
    var alasql = require('..');
  var _ = require('lodash');
} else {
    __dirname = '.';
};

describe('Test OUTER JOIN on more than 2 tables', function() {

  it('1. Test OUTER JOIN on more than 2 tables', function(done) {
    var data = {
      COLORS: [{id:1,name:'red'},{id:2,name:'blue'},{id:3,name:'orange'}],            
      FRUITS:[{id:1,name:'apple'},{id:2,name:'grape'},{id:3,name:'orange'},{id:4,name:'peaches'}],
      MASCOTS:[{id:1,name:'redsox'},{id:2,name:'whitesox'},{id:3,name:'orange'},{id:4,name:'peaches'}]
    };

    res = alasql('select t0.name t0n ,t1.name t1n, t2.name t2n from ? t0 outer join ? t1 on t1.name = t0.name outer join ? t2 on t2.name = t0.name or t2.name = t1.name',[data.COLORS,data.FRUITS, data.MASCOTS]);
    assert.deepEqual(res.data, [
    {t0n: 'red', t1n: undefined, t2n: undefined},
    {t0n: 'blue', t1n: undefined, t2n: undefined},
    {t0n: 'orange', t1n: 'orange', t2n: 'orange'},
    {t0n: undefined, t1n: 'apple', t2n: undefined},
    {t0n: undefined, t1n: 'grape', t2n: undefined},
    {t0n: undefined, t1n: 'peaches', t2n: 'peaches'},
    {t0n: undefined, t1n: undefined, t2n: 'redsox'},
    {t0n: undefined, t1n: undefined, t2n: 'whitesox'}
    ]);
    done();
  });
});

Gives:

[  
  {"t0n":"red","t1n":undefined,"t2n":undefined},
  {"t0n":"red","t1n":undefined,"t2n":"redsox"},
  {"t0n":"red","t1n":undefined,"t2n":"whitesox"},
  {"t0n":"red","t1n":undefined,"t2n":"orange"},
  {"t0n":"red","t1n":undefined,"t2n":"peaches"},
  {"t0n":"blue","t1n":undefined,"t2n":undefined},
  {"t0n":"blue","t1n":undefined,"t2n":"redsox"},
  {"t0n":"blue","t1n":undefined,"t2n":"whitesox"},
  {"t0n":"blue","t1n":undefined,"t2n":"orange"},
  {"t0n":"blue","t1n":undefined,"t2n":"peaches"},
  {"t0n":"orange","t1n":"orange","t2n":"orange"},
  {"t0n":"orange","t1n":undefined,"t2n":"redsox"},
  {"t0n":"orange","t1n":undefined,"t2n":"whitesox"},
  {"t0n":"orange","t1n":undefined,"t2n":"peaches"},
  {"t0n":undefined,"t1n":"apple","t2n":undefined},
  {"t0n":undefined,"t1n":"grape","t2n":undefined},
  {"t0n":undefined,"t1n":"peaches","t2n":"peaches"}
]
billschaller commented 8 years ago

Test script for SQL server:

create table colors (id int, name varchar(255))
create table fruits (id int, name varchar(255))
create table mascots (id int, name varchar(255))
go
insert into colors(id, name) values
(1, 'red'),
(2, 'blue'),
(3, 'orange')
insert into fruits(id, name) values
(1, 'apple'),
(2, 'grape'),
(3, 'orange'),
(4, 'peaches')
insert into mascots(id, name) values
(1, 'redsox'),
(2, 'whitesox'),
(3, 'orange'),
(4, 'peaches')
select t0.name t0n, t1.name t1n, t2.name t2n
from colors t0
full outer join fruits t1 on t1.name = t0.name
full outer join mascots t2 on t2.name = t1.name or t2.name = t0.name

Returns:

t0n t1n t2n
red NULL NULL
blue NULL NULL
orange orange orange
NULL apple NULL
NULL grape NULL
NULL peaches peaches
NULL NULL redsox
NULL NULL whitesox

Query plan: http://puu.sh/mlG2l/78e785be4a.png

billschaller commented 8 years ago

PostgreSQL test script:

drop table if exists colors;
drop table if exists fruits;
drop table if exists mascots;
create table colors (id int, name varchar(255));
create table fruits (id int, name varchar(255));
create table mascots (id int, name varchar(255));
insert into colors(id, name) values
(1, 'red'),
(2, 'blue'),
(3, 'orange');
insert into fruits(id, name) values
(1, 'apple'),
(2, 'grape'),
(3, 'orange'),
(4, 'peaches');
insert into mascots(id, name) values
(1, 'redsox'),
(2, 'whitesox'),
(3, 'orange'),
(4, 'peaches');
with t1 as (select COALESCE(t0.name, t1.name) AS name, t0.name as t0n, t0.id as t0id, t1.name as t1n, t1.id as t1id FROM colors t0 full outer join fruits t1 on t1.name = t0.name)
select t0n, t1n, t2.name as t2n
from t1
full outer join mascots t2 on t2.name = t1.name
order by t0id, t1id, t2.id

Like I said, Postgresql doesn't support full outer join with a predicate like OR , so I use a cte to materialize the first join.

Returns:

t0n t1n t2n
red NULL NULL
blue NULL NULL
orange orange orange
NULL apple NULL
NULL grape NULL
NULL peaches peaches
NULL NULL redsox
NULL NULL whitesox

Using the same exact query as postgres above (minus "full") in alasql gives the correct result.

[  
  {"t0n":"red","t1n":undefined,"t2n":undefined},
  {"t0n":"blue","t1n":undefined,"t2n":undefined},
  {"t0n":"orange","t1n":"orange","t2n":"orange"},
  {"t0n":undefined,"t1n":"apple","t2n":undefined},
  {"t0n":undefined,"t1n":"grape","t2n":undefined},
  {"t0n":undefined,"t1n":"peaches","t2n":"peaches"},
  {"t0n":undefined,"t1n":undefined,"t2n":"redsox"},
  {"t0n":undefined,"t1n":undefined,"t2n":"whitesox"}
]

This basically verifies that OUTER JOIN between 2 tables works as expected. It is only when more than 2 tables are being joined that alasql misbehaves.

agershun commented 8 years ago
  1. I just found a bug in parser, and now we can use FULL keyword in queries (in combination FULL OUTER JOIN).
  2. I will check these queries on January 11, 2016 in the office with MS SQL Server, to be sure.
  3. Thank you!
billschaller commented 8 years ago

@agershun Great!

_Very_ cool library you have going here, it is a really powerful concept and I hope you keep driving towards a 1.0 :smile:

agershun commented 8 years ago

Oracle also returns 8 lines (unfortunately I can not figure out which lines). Below is the test script:

create table colors (id int, name varchar(255));
create table fruits (id int, name varchar(255));
create table mascots (id int, name varchar(255));

insert into colors(id, name) values (1, 'red');
insert into colors(id, name) values (2, 'blue');
insert into colors(id, name) values (3, 'orange');

insert into fruits(id, name) values (1, 'apple');
insert into fruits(id, name) values (2, 'grape');
insert into fruits(id, name) values (3, 'orange');
insert into fruits(id, name) values (4, 'peaches');

insert into mascots(id, name) values (1, 'redsox');
insert into mascots(id, name) values (2, 'whitesox');
insert into mascots(id, name) values (3, 'orange');
insert into mascots(id, name) values (4, 'peaches');

select t0.name t0n, t1.name t1n, t2.name t2n 
from colors t0
full outer join fruits t1 on t1.name = t0.name
full outer join mascots t2 on t2.name = t1.name or t2.name = t0.name;

I tried to use thisOracle Terminal tool.

agershun commented 8 years ago

This join-procedure was the hardest part of AlaSQL...

mathiasrw commented 8 years ago

I love this - im gone for few hours and all THIS happens...

billschaller commented 8 years ago

To be honest, I'm pretty sure multiple joins cannot be handled safely with this approach. The outer join operations (left, right, full) are neither associative nor commutative, so it's not correct to evaluate them recursively like this. You could probably hack together an algo that could do it, but it would have to maintain so much state information that the performance would suffer, and it would be hard as hell to maintain.

I notice there are very few tests in your suite involving more than 1 join. This behavior will exist in multi-part left and right joins too. Might also be present in the anti and semi joins, but I haven't examined those very thoroughly.

What needs to happen here is that multiple joins need to be processed in a tree.

So if I have something like:

Select * from a
outer natural join b
outer natural join c

You would end up with a plan like:

     ______
     |join|
     ------
    /      \
_____    ______
| c |    |join|
-----    ------
        /      \
     _____    _____
     | b |    | a |
     -----    -----

Logically, the join between a and b would be evaluated, and then the result would be joined with c. The join between a and b does not need to be materialized completely before beginning to evaluate the join with c. While evaluating c, the join process would need to consume rows from the a|b join. Those rows can be produced sort of as a coroutine.

Basically, what you need is a query planner to determine the order of evaluation of the joins (with inner joins you can evaluate them in whatever order you wish, with outer joins they are order-dependent). There are a ton of cases also where it would be much faster to do a merge or hash join instead of a nested loop join. This is less critical when everything is in memory, but for large relations it can still have a huge impact on performance.

Implement each join operation as an operator that holds internal state, takes a left source, a right source, and a predicate. The interface for each operator would just be a fetch() method that evaluates the inputs and returns the next row, or false when completed.

Start by implementing separate operators for nested loop join modes. The query planner can simply assemble the join in the order it is written, taking into account the references in the predicates.

Once that's all working, you can add sort operators, merge or hash join operators, and cost based query planning if you really want to push performance.

Sorry to be dropping bombs :(

Happy to chat more any time you like, I'm on freenode, nick is zeroedin-bill.

mathiasrw commented 8 years ago

There are a ton of cases also where it would be much faster to do a merge or hash join instead of a nested loop join.

:+1:

agershun commented 8 years ago

This complexity is the reason why OUTER JOIN realized only in SQL Server and Oracle (and not in MySQL, Postgres, and SQLite). At the same time this is a reason, why we could not find much tests on multiple OUTER joins (nor RIGHT, SEMI or ANTI joins). Will work on it. Thank you!

billschaller commented 8 years ago

Postgres does support full outer join, just not with a non sargable predicate. Meaning, the ON clause has to work with a hash or merge join.

If you look at the SQL server query plan, you can see that the planner has actually transformed the full outer joins into a combination of left outer joins and left anti semi joins, all nested loops. With the tiny dataset in the test, it is using all nested loop joins. If you run that same query with a lot of data (>10000), you would see it start to use merge and hash joins instead, but probably in the same structure.

Check out the join tests for postgres, it is a good suite, and has some more interesting tests than sqlite: https://github.com/postgres/postgres/tree/master/src/test/regress

mathiasrw commented 8 years ago

https://github.com/postgres/postgres/tree/master/src/test/regress

Im looking into how they verify the tests so we can reuse them.

For others who like a little overview of the world of JOINs: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

billschaller commented 8 years ago

Also have a look at the executor implementation for some interesting reading.

https://github.com/postgres/postgres/tree/master/src/backend/executor

agershun commented 8 years ago

@mathiasrw About JOIN diagrams: Unfortunately, I could not find any diagram about how should look like JOIN on JOIN scheme...

mathiasrw commented 8 years ago

JOIN on JOIN scheme

Like several joins in a row? Isnt it like a plus or minus - you can only do two at the time and then do a join again on the result?

billschaller commented 8 years ago

@mathiasrw outer join is not associative or commutative so yeah you can pretty much only do 2 at once. There are some papers on map/reduce algos to do it, but that stuff is over my head.

agershun commented 8 years ago

I temporarily comment the assert line in the test (to prevent TrevisCI annoying messages), but this problem is still in the focus.

I want to prepare set of tests with SQL Server on multiple JOINs and some explanation Vienna diagrams.

@mathiasrw We can add this limitation to the README.md file.

billschaller commented 8 years ago

@agershun That sounds like a good first step - be careful of violating the SQL Server EULA (reverse engineering) in the process.

Using PostgreSQL as a basis for comparison or modelling on most behaviors is probably better than SQL Server just for legal purposes.

mathiasrw commented 8 years ago

I temporarily comment the assert line in the test (to prevent TrevisCI annoying messages), but this problem is still in the focus.

If it.(... is changed to it.skip(... it will jump the test and we can keep an eye on how many tests are skipped (its displayed in result)

@mathiasrw We can add this limitation to the README.md file.

Check !

agershun commented 8 years ago

@mathiasrw I will use this it.skip()notation.

agershun commented 8 years ago

I am not a lazy guy! :)

See this:

Good news: second INNER and LEFT joins are OK. Bad news: second RIGHT and OUTER joins are wrong :(

mathiasrw commented 8 years ago

WOW !

That is a LOT of good examples and a very good test.

Could you do the right join as a left join where you swap the order of the tables?

billschaller commented 8 years ago

@mathiasrw

Think of right joins like the - operator.

1 + 2 == 2 + 1

1 - 2 =/= 2 - 1

You could in fact switch it and make it a left join, but you'd also have to rearrange any other joins in the query in a complex way to retain the meaning of the logic. Most query plan optimizers do these sorts of rearrangements, but it does require some pretty robust logic and architecture to accomplish.