erikgrinaker / toydb

Distributed SQL database in Rust, written as a learning project
Apache License 2.0
5.99k stars 555 forks source link

Hash Join Issue? #61

Closed yywe closed 2 weeks ago

yywe commented 9 months ago

Hi @erikgrinaker,

Thank you for open source this great project. I would like to say this is the best resource to learn database principles. Meanwhile, the code quality is awesome, neat and elegant. Every piece is excellent.

I have a question for one of the optimizer, as shown below:

                (Expression::Field(a, a_label), Expression::Field(b, b_label)) => {
                    let (left_field, right_field) = if a < left_size {
                        ((a, a_label), (b - left_size, b_label))
                    } else {
                        ((b, b_label), (a - left_size, a_label))
                    };
                    Ok(Node::HashJoin { left, left_field, right, right_field, outer })
                }

when it is equal join, here we switch to hash join. based on Field(a) and Field(b). Not sure if I miss anything, but what if the corresponding field (column) are not unique ? if there are duplicate values in the fields. I feel the hash join will miss some rows?

The hash map is created by collect the key value pairs:

let right: HashMap<Value, Row> = rrows .map(|res| match res { Ok(row) if row.len() <= r => { Err(Error::Internal(format!("Right index {} out of bounds", r))) } Ok(row) => Ok((row[r].clone(), row)), Err(err) => Err(err), }) .collect::<Result<_>>()?;

if there are duplicate values, I think only the last pair will exist.

do I miss anything? Appreciate any feedback. Thanks!

erikgrinaker commented 9 months ago

Yep, you're right, nice find!

toydb> create table a (id int primary key, value string not null);
Created table a
toydb> create table b (id int primary key, value string not null);
Created table b
toydb> insert into a values (1, 'a'), (2, 'b');
Created 2 rows
toydb> insert into b values (1, 'b'), (2, 'b');
Created 2 rows
toydb> select a.id, b.id from a join b on a.value = b.value;
2|2
toydb> explain select a.id, b.id from a join b on a.value = b.value;
Projection: a.id, b.id
└─ HashJoin: inner on a.value = b.value
   ├─ Scan: a
   └─ Scan: b
laxjesse commented 8 months ago

encountered this while playing around a little and wanted to to share in case it was helpful. I believe that this test should actually return duplicate results e.g.

 Result: ["id", "title", "genre", "studio", "rating"]
 [Integer(10), String("Inception"), String("Science Fiction"), String("Warner Bros"), Float(8.8)]
+[Integer(10), String("Inception"), String("Science Fiction"), String("Warner Bros"), Float(8.8)]
+[Integer(1), String("Stalker"), String("Science Fiction"), String("Mosfilm"), Float(8.2)]
 [Integer(1), String("Stalker"), String("Science Fiction"), String("Mosfilm"), Float(8.2)]
 [Integer(4), String("Heat"), String("Action"), String("Warner Bros"), Float(8.2)]
+[Integer(4), String("Heat"), String("Action"), String("Warner Bros"), Float(8.2)]
+[Integer(6), String("Solaris"), String("Science Fiction"), String("Mosfilm"), Float(8.1)]
 [Integer(6), String("Solaris"), String("Science Fiction"), String("Mosfilm"), Float(8.1)]
 [Integer(7), String("Gravity"), String("Science Fiction"), String("Warner Bros"), Float(7.7)]
+[Integer(7), String("Gravity"), String("Science Fiction"), String("Warner Bros"), Float(7.7)]
+[Integer(9), String("Birdman"), String("Comedy"), String("Warner Bros"), Float(7.7)]
 [Integer(9), String("Birdman"), String("Comedy"), String("Warner Bros"), Float(7.7)]
 [Integer(5), String("The Fountain"), String("Science Fiction"), String("Warner Bros"), Float(7.2)]
+[Integer(5), String("The Fountain"), String("Science Fiction"), String("Warner Bros"), Float(7.2)]

(also just to reiterate, thanks for open sourcing this! it's great fun to play with)

erikgrinaker commented 2 weeks ago

This should be addressed now, but I'm going to have a closer look at this logic and add more test cases in a while.