chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/chdb
Apache License 2.0
2.13k stars 75 forks source link

Python function bindings #43

Open danthegoodman1 opened 1 year ago

danthegoodman1 commented 1 year ago

If we could bind functions like how DuckDB enables here https://duckdb.org/docs/api/python/function.html that would be extremely useful so we can make custom functions.

I'd love to see support for functions, aggregation functions, and table functions

lmangani commented 1 year ago

Hello @danthegoodman1 and thanks for the proposal. ClickHouse supports user-defined functions and external UDF functions so this should be eventually possible. We probably first need to address https://github.com/chdb-io/chdb/issues/26

Note: one potential way to achieve this would be adding some helpers to temporarily mount any such function (not limited to python) as a memfd endpoint like we do for virtual tables and use it as a UDF

lmangani commented 1 year ago

Found a related issue while experimenting: https://github.com/ClickHouse/ClickHouse/issues/51089

auxten commented 1 year ago

Thanks for the advice. UDF in Python will be super useful. I will do some exploration. The simple target is supporting something like:

import chdb

def udf_add(x: int, y: int) -> int:
    return x+y

chdb.reg_udf('add', udf_add)
chdb.query('select add(1, 2)')
lmangani commented 1 year ago

Relevant topic presentation: https://presentations.clickhouse.com/meetup74/ai/#cover

lmangani commented 1 year ago

Update on this thread:

danthegoodman1 commented 1 year ago

@lmangani is there an example we can link and close with?

lmangani commented 1 year ago

@danthegoodman1 this was just a blocker being removed, still quite a bit of of work to be done before its usable, stay tuned!

auxten commented 1 year ago

UDF feature is merged! #100 So, UDAF and Table Function are just on the way!

danthegoodman1 commented 1 year ago

Parenthesis on the decorator seems atypical? Keep in mind I’m not a Python expert :)

danthegoodman1 commented 1 year ago

What’s the behavior on non-native types like uint 128?

auxten commented 1 year ago

chdb_udf has an arg for return type. Sure, even Uint256 is ok.

from chdb.udf import chdb_udf
from chdb import query

@chdb_udf(return_type="UInt256")
def sum_udf(lhs, rhs):
    return int(lhs) + int(rhs)

print(query("select sum_udf(12,22)", "Debug"))

It seems declaring UInt256 or UInt128 is ok, but with some big number returned, got error like

root@154aa2d3682d:/# python test_udf.py
2023.09.05 06:34:16.363767 [ 446 ] {} <Debug> Application: Working directory created: /tmp/clickhouse-local-446-1693895656-10147068596235331766
Setting up /tmp/clickhouse-local-446-1693895656-10147068596235331766/tmp/ to store temporary data in it
Added users.xml access storage 'users.xml', path: 
Loading config file '/tmp/tmpewo8quh0/udf_config.xml'.
2023.09.05 06:34:16.374282 [ 446 ] {} <Debug> ConfigProcessor: Processing configuration file '/tmp/tmpewo8quh0/udf_config.xml'.
2023.09.05 06:34:16.374543 [ 446 ] {} <Debug> ConfigProcessor: Saved preprocessed configuration to '/tmp/clickhouse-local-446-1693895656-10147068596235331766/preprocessed_configs/_tmp_tmpewo8quh0_udf_config.xml'.
Will load 'sum_udf' because always_load_everything flag is set.
Will load the object 'sum_udf' immediately, force = false, loading_id = 1
Start loading object 'sum_udf'
Supposed update time for 'sum_udf' is never (loaded, lifetime 0)
Next update time for 'sum_udf' was set to 294247-01-10 04:00:54
00000000-0000-0000-0000-0000000001be Authenticating user 'default' from 127.0.0.1:0
00000000-0000-0000-0000-0000000001be Authenticated with global context as user 94309d50-4f52-5250-31bd-74fecac179db
00000000-0000-0000-0000-0000000001be Creating session context with user_id: 94309d50-4f52-5250-31bd-74fecac179db
Settings: readonly = 0, allow_ddl = true, allow_introspection_functions = false
List of all grants: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
List of all grants including implicit: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
select sum_udf(340282366920938463463374607431768211455,22)
00000000-0000-0000-0000-0000000001be Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
Query span trace_id for opentelemetry log: 00000000-0000-0000-0000-000000000000
(from 0.0.0.0:0, user: ) select sum_udf(340282366920938463463374607431768211455,22) (stage: Complete)
2023.09.05 06:34:16.375159 [ 446 ] {788933f1-ab23-454d-95c9-ae0845eeb46f} <Trace> ShellCommand: Will start shell command '/tmp/tmpewo8quh0/sum_udf.py' with arguments '/tmp/tmpewo8quh0/sum_udf.py'
2023.09.05 06:34:16.378851 [ 446 ] {788933f1-ab23-454d-95c9-ae0845eeb46f} <Trace> ShellCommand: Started shell command '/tmp/tmpewo8quh0/sum_udf.py' with pid 455
2023.09.05 06:34:16.378975 [ 446 ] {788933f1-ab23-454d-95c9-ae0845eeb46f} <Trace> ParallelParsingInputFormat: Parallel parsing is used
2023.09.05 06:34:16.394621 [ 446 ] {788933f1-ab23-454d-95c9-ae0845eeb46f} <Trace> ShellCommand: Try wait for shell command pid 455 with timeout 10
Code: 1. DB::Exception: Function 'sum_udf': wrong result, expected 1 row(s), actual 0: While processing sum_udf(3.402823669209385e38, 22). (UNSUPPORTED_METHOD) (version 23.6.1.1) (from 0.0.0.0:0) (in query: select sum_udf(340282366920938463463374607431768211455,22)), Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(String const&, int) @ 0x00000000189dc45a in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000104411d5 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
2. DB::Exception::Exception<String, unsigned long&, unsigned long&>(int, FormatStringHelperImpl<std::type_identity<String>::type, std::type_identity<unsigned long&>::type, std::type_identity<unsigned long&>::type>, String&&, unsigned long&, unsigned long&) @ 0x0000000012ff7001 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
3. ? @ 0x00000000168ce30a in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
4. DB::IFunction::executeImplDryRun(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long) const @ 0x000000000af709ea in /usr/loc
al/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
5. DB::FunctionToExecutableFunctionAdaptor::executeDryRunImpl(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long) const @ 0x000000000af706ae in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
6. DB::IExecutableFunction::executeWithoutLowCardinalityColumns(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x00000000145fcd51 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
7. DB::IExecutableFunction::defaultImplementationForConstantArguments(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x00000000145fc8c5 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
8. DB::IExecutableFunction::executeWithoutLowCardinalityColumns(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x00000000145fccf5 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
9. DB::IExecutableFunction::executeWithoutSparseColumns(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x00000000145fd5f5 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
10. DB::IExecutableFunction::execute(std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>> const&, std::shared_ptr<DB::IDataType const> const&, unsigned long, bool) const @ 0x00000000145fe67b in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
11. DB::ActionsDAG::addFunctionImpl(std::shared_ptr<DB::IFunctionBase const> const&, std::vector<DB::ActionsDAG::Node const*, std::allocator<DB::ActionsDAG::Node const*>>, std::vector<DB::ColumnWithTypeAndName, std::allocator<DB::ColumnWithTypeAndName>>, String, std::shared_ptr<DB::IDataType const>, bool) @ 0x0000000014c8d98f in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
12. DB::ActionsDAG::addFunction(std::shared_ptr<DB::IFunctionOverloadResolver> const&, std::vector<DB::ActionsDAG::Node const*, std::allocator<DB::ActionsDAG::Node const*>>, String) @ 0x0000000014c8d131 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
13. DB::ScopeStack::addFunction(std::shared_ptr<DB::IFunctionOverloadResolver> const&, std::vector<String, std::allocator<String>> const&, String) @ 0x0000000014e4d347 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
14. DB::ActionsMatcher::Data::addFunction(std::shared_ptr<DB::IFunctionOverloadResolver> const&, std::vector<String, std::allocator<String>> const&, String) @ 0x0000000014e5a9f3 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
15. DB::ActionsMatcher::visit(DB::ASTFunction const&, std::shared_ptr<DB::IAST> const&, DB::ActionsMatcher::Data&) @ 0x0000000014e4f650 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
16. DB::ActionsMatcher::visit(DB::ASTExpressionList&, std::shared_ptr<DB::IAST> const&, DB::ActionsMatcher::Data&) @ 0x0000000014e56889 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
17. DB::InDepthNodeVisitor<DB::ActionsMatcher, true, false, std::shared_ptr<DB::IAST> const>::doVisit(std::shared_ptr<DB::IAST> const&) @ 0x0000000014e45135 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
18. DB::ExpressionAnalyzer::getRootActions(std::shared_ptr<DB::IAST> const&, bool, std::shared_ptr<DB::ActionsDAG>&, bool) @ 0x0000000014e26c71 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
19. DB::SelectQueryExpressionAnalyzer::appendSelect(DB::ExpressionActionsChain&, bool) @ 0x0000000014e31c50 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
20. DB::ExpressionAnalysisResult::ExpressionAnalysisResult(DB::SelectQueryExpressionAnalyzer&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, bool, bool, bool, std::shared_ptr<DB::FilterDAGInfo> const&, std::shared_ptr<DB::FilterDAGInfo> const&, DB::Block const&) @ 0x0000000014e3780f in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
21. DB::InterpreterSelectQuery::getSampleBlockImpl() @ 0x0000000015628c8d in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
22. ? @ 0x00000000156229b5 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
23. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, std::optional<DB::Pipe>, std::shared_ptr<DB::IStorage> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::PreparedSets>) @ 0x000000001561d806 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
24. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x000000001561a957 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
25. DB::InterpreterSelectWithUnionQuery::buildCurrentChildInterpreter(std::shared_ptr<DB::IAST> const&, std::vector<String, std::allocator<String>> const&) @ 0x00000000156b12c6 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
26. DB::InterpreterSelectWithUnionQuery::InterpreterSelectWithUnionQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x00000000156af6bf in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
27. ? @ 0x000000001597cfd5 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
28. DB::InterpreterFactory::get(std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&) @ 0x000000001597c735 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
29. ? @ 0x000000001595b6da in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
30. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x0000000015958cbc in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so
31. DB::LocalConnection::sendQuery(DB::ConnectionTimeouts const&, String const&, std::unordered_map<String, String, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, String>>> const&, String const&, unsigned long, DB::Settings const*, DB::ClientInfo const*, bool, std::function<void (DB::Progress const&)>) @ 0x00000000164c7583 in /usr/local/lib/python3.11/site-packages/chdb/_chdb.cpython-311-x86_64-linux-gnu.so

Peak memory usage (for query): 37.13 MiB.
Unloading 'sum_udf' because its configuration has been removed or detached
2023.09.05 06:34:16.532681 [ 446 ] {} <Debug> Application: Removing temporary directory: /tmp/clickhouse-local-446-1693895656-10147068596235331766
Code: 1. DB::Exception: Function 'sum_udf': wrong result, expected 1 row(s), actual 0: While processing sum_udf(3.402823669209385e38, 22). (UNSUPPORTED_METHOD)
2023.09.05 06:34:16.533059 [ 446 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem