BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.93k stars 183 forks source link

[FEA] Semi Join #87

Closed VibhuJawa closed 4 years ago

VibhuJawa commented 4 years ago

I would like for blazing-sql to support semi join .

Semijoins are a way to filter a rowset based on the inclusion of its rows in another rowset.

SQL equivalent:

SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B) pattern. 

There are two variants: LEFT SEMIJOIN and RIGHT SEMIJOIN .

Details about semi join from :

https://docs.microsoft.com/en-us/u-sql/statements-and-expressions/select/from/joins/semijoin

felipeblazing commented 4 years ago

I just tried the following and it worked for me:

import cudf
bc = BlazingContext()
bc.s3('bsql_data', bucket_name='blazingsql-colab', access_key_id='AKIAJGB3SR3IXU3TE5WA', secret_key='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')

bc.create_table('nation', 's3://bsql_data/tpch_sf1/nation/0_0_0.parquet')
result = bc.sql('select * from nation where nation.n_nationkey in ( select other.n_nationkey from nation as other where n_nationkey = 16)').get()
print(result.columns)

The output i got was


0           16  MOZAMBIQUE            0   

                                       n_comment  
0  s. ironic, unusual asymptotes wake blithely r  ```
felipeblazing commented 4 years ago

Can you show me a complete example where this is not working how you would expect?

VibhuJawa commented 4 years ago

Sorry for being unclear , I would like left-join to work natively i.e , i would like below sql query to work.

  SELECT e.EmpName, e.DepID  
  FROM @employees AS e  
  LEFT SEMIJOIN (SELECT (int?) DepID AS DepID, DepName FROM @departments) AS d  
  ON e.DepID == d.DepID;  

Question

Will there be performance implications for using SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B) pattern instead of left-semi join or do you expect the performance to remain the same ?

felipeblazing commented 4 years ago

this is the relational algebra it generates. I am not sure what the semi join would generate but I am guessing something very similar.

  LogicalJoin(condition=[=($0, $4)], joinType=[inner])
    LogicalTableScan(table=[[main, nation]])
    LogicalAggregate(group=[{0}])
      BindableTableScan(table=[[main, nation]], filters=[[=($0, 16)]], projects=[[0]], aliases=[[n_nationkey]])
felipeblazing commented 4 years ago

In the future we will be able to optimize out LogicalAggregate(group=[{0}]) using the CBO when the column in question is unique but that is currently not supported.

VibhuJawa commented 4 years ago

this is the relational algebra it generates. I am not sure what the semi join would generate but I am guessing something very similar.

  LogicalJoin(condition=[=($0, $4)], joinType=[inner])
    LogicalTableScan(table=[[main, nation]])
    LogicalAggregate(group=[{0}])
      BindableTableScan(table=[[main, nation]], filters=[[=($0, 16)]], projects=[[0]], aliases=[[n_nationkey]])

Thanks for looking into this, will update on this issue if using this pattern is not as performent as we would like on my use-case .