Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Should explain union between 3+ queries be flattened? #4856

Open edongashi opened 3 years ago

edongashi commented 3 years ago
explain select value from integers union select value from integers union select value from integers
query plan
----------
query (standard)
  --> union (standard)
    --> union (standard)
      --> integers (standard)
        --> integers (non-personal table)
    --> union (standard)
      --> integers (standard)
        --> integers (non-personal table)
  --> union (standard)
    --> integers (standard)

Should all unions be at the same level like joins are?

explain select count(*)
from integers i1
inner join integers i2 on i1.user_id = i2.user_id
inner join integers i3 on i1.user_id = i3.user_id
query plan
----------
query (standard)
  --> i1 (standard)
    --> integers (non-personal table)
  --> i2 (standard)
    --> integers (non-personal table)
  --> i3 (standard)
    --> integers (non-personal table)
cristianberneanu commented 3 years ago

I'm not even sure joins should be on the same level. The order of the operations might matter in some cases. How is Postgres doing it?

edongashi commented 3 years ago

With empty dummy tables:

explain select count(*) from t1 inner join t2 on t1.id=t2.id inner join t3 on t1.id=t3.id;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Aggregate  (cost=190.03..190.04 rows=1 width=8)
   ->  Hash Join  (cost=134.75..183.66 rows=2550 width=0)
         Hash Cond: (t1.id = t3.id)
         ->  Hash Join  (cost=67.38..109.58 rows=2550 width=8)
               Hash Cond: (t1.id = t2.id)
               ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
               ->  Hash  (cost=35.50..35.50 rows=2550 width=4)
                     ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
         ->  Hash  (cost=35.50..35.50 rows=2550 width=4)

With a more real database:

explain select * 
from sales.Store
inner join sales.Customer on sales.Store.BusinessEntityID = sales.Customer.StoreID
inner join sales.SalesOrderHeader on sales.Customer.CustomerID = sales.SalesOrderHeader.CustomerID
inner join sales.SalesOrderDetail on sales.SalesOrderHeader.SalesOrderID = sales.SalesOrderDetail.SalesOrderID;
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=673.43..11086.32 rows=121317 width=1010)
   Hash Cond: (customer.storeid = store.businessentityid)
   ->  Hash Join  (cost=612.65..10706.95 rows=121317 width=535)
         Hash Cond: (salesorderheader.customerid = customer.customerid)
         ->  Merge Join  (cost=0.70..9776.48 rows=121317 width=495)
               Merge Cond: (salesorderheader.salesorderid = salesorderdetail.salesorderid)
               ->  Index Scan using "PK_SalesOrderHeader_SalesOrderID" on salesorderheader  (cost=0.29..1549.26 rows=31465 width=431)
               ->  Index Scan using "PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID" on salesorderdetail  (cost=0.42..6632.30 rows=121317 width=64)
         ->  Hash  (cost=364.20..364.20 rows=19820 width=40)
               ->  Seq Scan on customer  (cost=0.00..364.20 rows=19820 width=40)
   ->  Hash  (cost=52.01..52.01 rows=701 width=475)
         ->  Seq Scan on store  (cost=0.00..52.01 rows=701 width=475)
(12 rows)

Worthy of note that join order matters in DBs. Postgres uses heuristics (GEQO) to determine a good join order. When there's a low number of paths it does an exhaustive search to find the optimal solution.

cristianberneanu commented 3 years ago

I gave this more thought and I think the joins display is the problematic part. Besides the fact that the join tree is collapsed, the information about which joins are emulated is not shown either. Not convinced it is worth fixing though.

edongashi commented 3 years ago

For postgres UNION returns a flat plan. The union/append operator is monoidal, meaning it's associative.

postgres=# explain select * from t1 union select * from t2 union select * from t3;
                            QUERY PLAN                            
------------------------------------------------------------------
 HashAggregate  (cost=240.38..316.88 rows=7650 width=4)
   Group Key: t1.id
   ->  Append  (cost=0.00..221.25 rows=7650 width=4)
         ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
         ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

For JOINs it's reasonable that it's a tree in DBs, because performance can vary between different paths. It also shows JOIN condition between relations, which we don't currently do. I think showing join conditions in the plan would be good.

Though if we indicate any join priority it may be misleading because the DB to which it's offloaded may decide on a different order. At least we should show the order it's done when emulated.

cristianberneanu commented 3 years ago

The union/append operator is monoidal, meaning it's associative.

This is not true in the general case.

(select 1 union distinct select 2) union all select 1;

is different from

select 1 union distinct (select 2 union all select 1);

For JOINs it's reasonable that it's a tree in DBs, because performance can vary between different paths.

I don't think the JOIN performance is relevant here. OUTER JOINs are not associative.

edongashi commented 3 years ago

I didn't know we supported union all. Should that be indicated in the explain plan? Currently both of the following return the same explanation:

explain select count(*) from test union all select count(*) from test
explain select count(*) from test union distinct select count(*) from test
cristianberneanu commented 3 years ago

Should that be indicated in the explain plan?

Yes, I think that would make sense.

edongashi commented 3 years ago

I think we shouldn't try to strictly respect associativity.

The original idea of explain has been to tell types of queries and see what's emulated: https://attack.aircloak.com/docs/#/datastores?id=emulation-overview

I'm not sure how emulated joins work and if we currently represent those.