chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.01k stars 72 forks source link

Create table view/ import from pandas in a session #258

Open DanielMao1 opened 3 weeks ago

DanielMao1 commented 3 weeks ago

Hello! : ) I am working on some feature engineering tasks and I find your great job. But I encountered some API limitations. Such APIs are supported by clickhouse and duckdb, but not supported by chDB. So I would like to discuss this issue.

Use case

When I using clickhouse, my use case is like:

import pandas as pd
import clickhouse_connect

# 1. load some dataframes
ads = pd.DataFrame({
    'ad_id': [1, 2, 3, 4, 5],
    'userid': [101, 102, 103, 104, 105]
})

users = pd.DataFrame({
    'userid': [101, 102, 103, 106, 107]
})

pageviews = pd.DataFrame({
    'userid': [101, 102, 103, 101, 105],
    'platform': [1, 2, 1, 2, 1]
})
# 2. import dataframes into clickhouse
client = clickhouse_connect.get_client(host='localhost')

client.command('DROP TABLE IF EXISTS ads')
client.command('DROP TABLE IF EXISTS users')
client.command('DROP TABLE IF EXISTS pageviews')

client.command('CREATE TABLE ads (ad_id Int64, userid Int64) ENGINE = Memory')
client.command('CREATE TABLE users (userid Int64) ENGINE = Memory')
client.command('CREATE TABLE pageviews (userid Int64, platform Int64) ENGINE = Memory')

client.insert_df('ads', ads)
client.insert_df('users', users)
client.insert_df('pageviews', pageviews)
# generate temporary results
temp_query = """
SELECT
    ads.ad_id,
    ads.userid
FROM ads
JOIN users ON ads.userid = users.userid
"""

client.command('DROP TABLE IF EXISTS temp_ads_users')
client.command('CREATE TABLE temp_ads_users ENGINE = Memory AS ' + temp_query)

# generate final results
final_query = """
SELECT
    temp_ads_users.ad_id,
    mode() AS most_common_platform
FROM temp_ads_users
JOIN pageviews ON temp_ads_users.userid = pageviews.userid
GROUP BY temp_ads_users.ad_id
"""

result_df = client.query_df(final_query)
print(result_df)

Describe the solution you'd like

*In chDB, I tried the API for a whole day but I can not find any APIs like clickhouse's insert_df nor any APIs like duckdb's `duckdb.sql("INSERT INTO my_table BY NAME SELECT FROM my_df")`. I woule like such APIs so that I can import my data into servers, and reuse my analysis query.** I do know if it is easy to wrap up on clickhouse c++ library or it is to be implemented in c++. (By the way, the clickhouse insert df is not efficient, that's the reason I try chdb)

Describe alternatives you've considered

I looked the official example do the sql look like this, but the problem is my query may consists of create temp table statement, which cannot be done by chdb.dataframe (It can only query but not create). Instead, I would like to insert the dataframe into db (logically in a session), and in this session I can do many SQLs.

import chdb.dataframe as cdf
import pandas as pd

employees = pd.read_csv("employees.csv")
departments = pd.read_csv("departments.csv")

query_sql = """
select
  emp_id, first_name,
  dep.name as dep_name,
  salary
from __emp__ as emp
    join __dep__ as dep using(dep_id)
order by salary desc;
"""

res = cdf.query(sql=query_sql, emp=employees, dep=departments)
print(res, end="")

Additional context

Look forward to you reply : )

l1t1 commented 2 weeks ago

create temporary table seems works, but cannot select it.

>>> from chdb.session import Session
>>>
>>> db = Session()
>>> db.query("create database db")

>>> db.query("use db")
>>> x="create temporary table data (id UInt32, x UInt32) engine MergeTree order by id sample by id as select number+1 as id, randUniform(1, 100) as x from numbers(10000);"
>>> db.query(x)                                                                         
>>> y='select avg(x) as "avg", round(quantile(0.95)(x), 2) as p95 from data sample 0.1;'>>> db.query(y)
Code: 60. DB::Exception: Table db.data does not exist. (UNKNOWN_TABLE)

>>> x="create table data (id UInt32, x UInt32) engine MergeTree order by id sample by id as select number+1 as id, randUniform(1, 100) as x from numbers(10000);"
>>> db.query(x)                                                                         
>>> db.query(y)                                                                         50.2891,95
l1t1 commented 2 weeks ago

try Query on Pandas DataFrame of https://clickhouse.com/docs/en/chdb/install/python

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()
auxten commented 2 weeks ago

try Query on Pandas DataFrame of https://clickhouse.com/docs/en/chdb/install/python

import chdb
import pandas as pd
df = pd.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

chdb.dataframe is going to be deprecated soon Please use the example above to handle dataframe with chDB

l1t1 commented 2 weeks ago

still can't select from table

>>>
>>> chdb.query("create table a engine=Memory as SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

>>> chdb.query("select * from a").show()                                                Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/chdb/__init__.py", line 78, in query
    raise ChdbError(res.error_message())
chdb.ChdbError: Code: 60. DB::Exception: Unknown table expression identifier 'a' in scope SELECT * FROM a. (UNKNOWN_TABLE)
l1t1 commented 2 weeks ago

when use session, the result is empty.

>>> from chdb.session import Session
>>> db = Session()
>>> db.query("create database db")

>>> db.query("use db")

>>> import chdb
>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
...     }
... )
>>> db.query("create table a engine=Memory as SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

>>> db.query("select * from a").show()
l1t1 commented 2 weeks ago

engine=Log works. engine=Memory did not insert rows

>>> db.query("select count() from a")
0

>>> db.query("create table a1 engine=Log as SELECT b, sum(a) FROM Python(df) GROUP BY b ORDER BY b").show()

>>> db.query("select count() from a1")                                                  3

>>> db.query("select * from a1")                                                        "auxten",9
"jerry",7
"tom",5