chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/docs/en/chdb
Apache License 2.0
2.03k stars 72 forks source link

Add support for stateful query #55

Closed djouallah closed 9 months ago

djouallah commented 1 year ago

can we have something like this

chdb.query("create or replace view xxx as select 42")

chdb.query(" select from xxx")

lmangani commented 1 year ago

Hello @djouallah

This is a moving target as we expand chdb feature, but here are a few ways off the top of my head.

Multiquery

chdb statements are multiquery, so if you logic allows you can execute your statements sequentially:

chdb.query("create or replace view xxx as select 42; select from xxx;")

Temporary File

You can use temporary dataframe tables to exchange data between calls using the virtual __table__ selector:

import chdb.dataframe as cdf
import pandas as pd
tbl = cdf.Table(dataframe=pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}))
ret_tbl = tbl.query('select * from __table__')
print(ret_tbl)
print(ret_tbl.query('select b, sum(a) from __table__ group by b'))

Pipe

Last but not least, you can pipe data in/out of chdb executions using stdin/out (from the script or anything else)

import sys
import pandas as pd
import pyarrow as pa
import chdb
import subprocess

df = pd.DataFrame({'a': [2, 2], 'b': [4, 4]})
# convert Dataframe to Arrow stream
data = pa.BufferOutputStream()
pa.RecordBatchStreamWriter(data, pa.Table.from_pandas(df).schema).write_table(pa.Table.from_pandas(df))
data = data.getvalue()
query = '''
CREATE TABLE table ENGINE = File(ArrowStream, stdin); SELECT a + b as sum FROM table;
'''
# Spawn the subprocess and pass custom input from the parent process
def spawn_chdb(query, stdin, format):
    proc = subprocess.Popen(['python3', '-m', 'chdb', query, format], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    custom_input = data
    stdout = proc.communicate(input=stdin)[0].decode('utf-8')
    return stdout

# Call the function to spawn the subprocess with custom input
res = spawn_chdb(query, data, "Dataframe")
# Run the query
print(res)

Hope this helps. If I missed any other way, the team will fill in :)

djouallah commented 1 year ago

thanks, but that does not replace the functionality of a view, and please don't tag is as a question, it is a feature request.

lmangani commented 1 year ago

can we have something like this

that's a question right there 😄 if its a feature request, please add it to the title anytime.

djouallah commented 1 year ago

interesting, it seems to be working, at least i don't get errors but no result :) https://colab.research.google.com/drive/1o07UEUUAxcdR9WCdYt53mVZBvqqbqzxn#scrollTo=MDTt42NXEL5_

auxten commented 1 year ago

@djouallah There is an issue, the chdb default db is not saved in the session(default behavior in clickhouse-local). If you want to preserve it, you can create a new db and create table, view in the new db.

It might be wired for now, I can fix that in the v0.12 release. Any suggestion?

djouallah commented 1 year ago

ideally something like this will be great, I prefer not to hard code the db to the table name, and show method will be really nice for python compatibility

from chdb import session as chs
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query(f'create view db_xxx.lineitem as select * from file("./{sf}/lineitem/*.parquet", Parquet)')
sess.query("use db_xxx")
sess.query("SELECT count(*) FROM lineitem").show()
auxten commented 1 year ago

Great suggestion about use db_xxx. But why ret.show() for the result, similar to print(ret)? Any reference who supports this?

djouallah commented 1 year ago

Spark, duckdb, datafusion, it is not a very big deal , just a personal preference, don't worry about :)

use db_xxx. will be really nice for SQL compatibility

djouallah commented 1 year ago

using 0.14.1, still does not works :(

lmangani commented 1 year ago

@djouallah this release was focused on UDF support, but don't worry use support is still on the way!