arana-db / arana

Arana is a Cloud Native Database Proxy. It can also be deployed as a Database mesh sidecar.
http://arana-docs.rtfd.io/
Apache License 2.0
292 stars 92 forks source link

Implement a low-level HASH-JOIN dataset #622

Open jjeffcaii opened 1 year ago

jjeffcaii commented 1 year ago

The HASH-JOIN dataset API could be similar with below codes:

func HashJoin(left,right Dataset, joinColumns, ...other options) Dataset {
    //  ...
}

The HASH-JOIN should contain two phases:

  1. build hash chunk from the left dataset, a chunk looks like a hash map: key=hash(values of join_columns), value=rows
  2. probe each row in right dataset, compute the hash key, then check if the row matched one by one

some docs:

A tiny example, we have two datasets, and we want to execute SQL like select foo.id,bar.id from foo join bar on foo.x = bar.y

--- Dataset foo id x
a 5
b 6
c 7
--- Dataset bar id y
j 5
k 8
  1. build from foo: a hash map by hash method x -> x%2, we got a map like { 0: [b-6], 1: [a-5,c-7] }
  2. probe from bar: j-5 will check the chunk[key=1] and the k-8 will check the chunk[key=0]
  3. finally, a-5 and j-5, bingo!
wang1309 commented 1 year ago

pls assign to me