chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

TPCH_SF0.1 get stuck randomly #73

Closed djouallah closed 1 year ago

djouallah commented 1 year ago

for some reason it get really slow randomly even for very small data ?

reproducible example here

https://colab.research.google.com/drive/1o07UEUUAxcdR9WCdYt53mVZBvqqbqzxn#scrollTo=FT8O269BJNfE

lmangani commented 1 year ago

Thanks for raising this issue. We should perhaps check if this is related to the Sessions feature by trying ad-hoc query execution instead and comparing the results.

auxten commented 1 year ago

@djouallah there some bug in Generate TPCH Parquet files section. Can help fixed that, so I can continue debug?

djouallah commented 1 year ago

sorry fixed

auxten commented 1 year ago

There are two issues:

  1. print content is buffered too much in Colab.
  2. Query 19 taking too much time. As a result, chDB seems to be stuck randomly at output query 14~18.

So:

  1. To solve the first issue, you can try print with flush:
    print(engine.query(value, 'Debug'), flush=True)
  2. The second issue seems to be a ClickHouse issue of slow query on this kind of implicit JOIN. The original SQL you wrote takes about 100s on my 104 cores host. Change SQL to an explicit JOIN could fix that:
    select
    --Query19
    sum(l_extendedprice* (1 - l_discount)) as revenue
    from 
    file("./0.1/lineitem/*.parquet") as line
    join
    file("./0.1/part/*.parquet") as part
    ON p_partkey = l_partkey
    where ( 
        p_brand = 'Brand#12'
        and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
        and l_quantity >= 1 and l_quantity <= 1 + 10
        and p_size between 1 and 5
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    ) or ( p_partkey = l_partkey
        and p_brand = 'Brand#23'
        and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
        and l_quantity >= 10 and l_quantity <= 10 + 10
        and p_size between 1 and 10
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    ) or ( p_partkey = l_partkey
        and p_brand = 'Brand#34'
        and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
        and l_quantity >= 20 and l_quantity <= 20 + 10
        and p_size between 1 and 15
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    );
  3. Another thing is the VIEW is not very necessary on TPCH SF0.1 test. Something like file("./0.1/lineitem/*.parquet") as line is Ok.

Here is the "fixed" version Colab Notebook: https://colab.research.google.com/drive/1OhuO2Lyey3IEoT-Tg_kmBRd72DvRVW0R?usp=sharing

The original slow SQL looks like this:

CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;
create view db_xxx.lineitem      as select * from file("./0.1/lineitem/*.parquet", Parquet);
create view db_xxx.orders        as select * from file("./0.1/orders/*.parquet", Parquet);
create view db_xxx.customer      as select * from file("./0.1/customer/*.parquet", Parquet);
create view db_xxx.nation        as select * from file("./0.1/nation/*.parquet", Parquet);
create view db_xxx.part          as select * from file("./0.1/part/*.parquet", Parquet);
create view db_xxx.partsupp      as select * from file("./0.1/partsupp/*.parquet", Parquet);
create view db_xxx.region        as select * from file("./0.1/region/*.parquet", Parquet);
create view db_xxx.supplier      as select * from file("./0.1/supplier/*.parquet", Parquet);
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from db_xxx.lineitem, db_xxx.part
where ( p_partkey = l_partkey
        and p_brand = 'Brand#12'
        and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
        and l_quantity >= 1 and l_quantity <= 1 + 10
        and p_size between 1 and 5
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    ) or ( p_partkey = l_partkey
        and p_brand = 'Brand#23'
        and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
        and l_quantity >= 10 and l_quantity <= 10 + 10
        and p_size between 1 and 10
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    ) or ( p_partkey = l_partkey
        and p_brand = 'Brand#34'
        and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
        and l_quantity >= 20 and l_quantity <= 20 + 10
        and p_size between 1 and 15
        and l_shipmode in ('AIR', 'AIR REG')
        and l_shipinstruct = 'DELIVER IN PERSON'
    );
auxten commented 1 year ago

I will raise an issue for the implicit JOIN slow issue later.

djouallah commented 1 year ago

Thanks for your help