chdb-io / chdb

chDB is an in-process OLAP SQL Engine ๐Ÿš€ powered by ClickHouse
https://clickhouse.com/chdb
Apache License 2.0
2.13k stars 75 forks source link

Query csv with group by very slow #31

Closed xbsura closed 1 year ago

xbsura commented 1 year ago

Describe the situation import chdb res=chdb.query('select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc', 'CSV')

wc -l /Users/xbsura/Downloads/organizations-2000000.csv 2000001 /Users/xbsura/Downloads/organizations-2000000.csv

head /Users/xbsura/Downloads/organizations-2000000.csv Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees 1,391dAA77fea9EC1,Daniel-Mcmahon,https://stuart-rios.biz/,Cambodia,Focused eco-centric help-desk,2013,Sports,1878 2,9FcCA4A23e6BcfA,"Mcdowell, Tate and Murray",http://jacobs.biz/,Guyana,Front-line real-time portal,2018,Legal Services,9743 3,DB23330238B7B3D,"Roberts, Carson and Trujillo",http://www.park.com/,Jordan,Innovative hybrid data-warehouse,1992,Hospitality,7537 4,bbf18835CFbEee7,"Poole, Jefferson and Merritt",http://hayden.com/,Cocos (Keeling) Islands,Extended regional Graphic Interface,1991,Food Production,9974

this sql need more than 1min to finish, and memory used is more than 100G

Expected performance 200MB file, maybe less than 1 seconds is ok

auxten commented 1 year ago

Where can we download organizations-2000000.csv?

xbsura commented 1 year ago

https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-2000000.zip

download from here

auxten commented 1 year ago

This is a serious bug, any CSV data above 200MB with aggression can reproduce. Will dig into this soon.

lmangani commented 1 year ago

Test seems to be passing with chdb 0.8.0 and libchdb 0.8.0 which includes #32 by @auxten

lmangani commented 1 year ago

@xbsura could you kindly retest and confirm the latest release fixes the reported issue? Thanks for your report!

xbsura commented 1 year ago

@xbsura could you kindly retest and confirm the latest release fixes the reported issue? Thanks for your report!

confirm fixed, thanks