chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/chdb
Apache License 2.0
2.17k stars 75 forks source link

question about UDF, if it's possible the calculation between 2 dicts(python) is supported? #118

Closed zhuzhuyan93 closed 8 months ago

zhuzhuyan93 commented 1 year ago

when i generate the code:

cdf.query("select similarity(vec, vec2) as score from __tb1__", tb1=tbl3)  

i got Code: 1. DB::Exception: Function 'similarity': wrong result, expected 8 row(s), actual 0: while executing 'FUNCTION similarity(vec :: 0, vec2 :: 1) -> similarity(vec, vec2) String : 2'. (UNSUPPORTED_METHOD)

in which UDF defined as:

from chdb.udf import chdb_udf
import numpy as np    

@chdb_udf()
def similarity(a, b):
    a = dict(a) 
    a = {k: 0 if v is None else v for k, v in a.items()}
    b = dict(b) 
    b = {k: 0 if v is None else v for k, v in b.items()}
    if len(b) < len(a): 
        a, b = b, a
    res = 0
    for key, a_value in a.items():
        res += a_value * b.get(key, 0)
    z = (sum(map(lambda x: x * x, list(a.values()))) ** .5) * (sum(
        map(lambda x: x * x, list(b.values()))) ** .5)
    if res == 0 or z == 0:
        return 0.0
    else:
        return np.round(res / z, 6)

@chdb_udf()
def add2(v1, v2):
    return int(v1) + int(v2)    

similarity({'A': 1.0, 'B': None, 'C': 2.0, 'D': None}, {'B': 2, 'D': 0.5})  
auxten commented 8 months ago

Decorator for chDB Python UDF(User Defined Function).

  1. The function should be stateless. So, only UDFs are supported, not UDAFs(User Defined Aggregation Function).
  2. Default return type is String. If you want to change the return type, you can pass in the return type as an argument. The return type should be one of the following: https://clickhouse.com/docs/en/sql-reference/data-types
  3. The function should take in arguments of type String. As the input is TabSeparated, all arguments are strings.
  4. The function will be called for each line of input. Something like this:

    def sum_udf(lhs, rhs):
        return int(lhs) + int(rhs)
    
    for line in sys.stdin:
        args = line.strip().split('\t')
        lhs = args[0]
        rhs = args[1]
        print(sum_udf(lhs, rhs))
        sys.stdout.flush()
  5. The function should be pure python function. You SHOULD import all python modules used IN THE FUNCTION.
    def func_use_json(arg):
        import json
        ...
  6. Python interpertor used is the same as the one used to run the script. Get from sys.executable