Closed djouallah closed 1 year ago
Thanks for raising this issue. We should perhaps check if this is related to the Sessions feature by trying ad-hoc query execution instead and comparing the results.
@djouallah there some bug in Generate TPCH Parquet files
section. Can help fixed that, so I can continue debug?
sorry fixed
There are two issues:
So:
print(engine.query(value, 'Debug'), flush=True)
select
--Query19
sum(l_extendedprice* (1 - l_discount)) as revenue
from
file("./0.1/lineitem/*.parquet") as line
join
file("./0.1/part/*.parquet") as part
ON p_partkey = l_partkey
where (
p_brand = 'Brand#12'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 1 and l_quantity <= 1 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
) or ( p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
) or ( p_partkey = l_partkey
and p_brand = 'Brand#34'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 20 and l_quantity <= 20 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
file("./0.1/lineitem/*.parquet") as line
is Ok.Here is the "fixed" version Colab Notebook: https://colab.research.google.com/drive/1OhuO2Lyey3IEoT-Tg_kmBRd72DvRVW0R?usp=sharing
The original slow SQL looks like this:
CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;
create view db_xxx.lineitem as select * from file("./0.1/lineitem/*.parquet", Parquet);
create view db_xxx.orders as select * from file("./0.1/orders/*.parquet", Parquet);
create view db_xxx.customer as select * from file("./0.1/customer/*.parquet", Parquet);
create view db_xxx.nation as select * from file("./0.1/nation/*.parquet", Parquet);
create view db_xxx.part as select * from file("./0.1/part/*.parquet", Parquet);
create view db_xxx.partsupp as select * from file("./0.1/partsupp/*.parquet", Parquet);
create view db_xxx.region as select * from file("./0.1/region/*.parquet", Parquet);
create view db_xxx.supplier as select * from file("./0.1/supplier/*.parquet", Parquet);
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from db_xxx.lineitem, db_xxx.part
where ( p_partkey = l_partkey
and p_brand = 'Brand#12'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 1 and l_quantity <= 1 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
) or ( p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
) or ( p_partkey = l_partkey
and p_brand = 'Brand#34'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 20 and l_quantity <= 20 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
I will raise an issue for the implicit JOIN slow issue later.
Thanks for your help
for some reason it get really slow randomly even for very small data ?
reproducible example here
https://colab.research.google.com/drive/1o07UEUUAxcdR9WCdYt53mVZBvqqbqzxn#scrollTo=FT8O269BJNfE