chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
https://clickhouse.com/chdb
Apache License 2.0
2.13k stars 75 forks source link

Long living client #108

Closed Dennitz closed 11 months ago

Dennitz commented 1 year ago

As requested in Discord, adding my draft implementation for a long-living client.

This branch is based off an older revision, e503e5b7f49ab8025caace75c255f138dbc508fc to be precise. So there are conflicts that would need to be resolved.

Also, the code for the chdb Python library is currently commented out, as I'm only using libchdb personally and wanted to avoid compile errors.

Please see this PR as a starting point or just as inspiration, and feel free to make any necessary changes that would be needed to get this integrated.


The main difference of my fork is that it allows creating a “chdb instance” (a LocalServer object), and reusing that for multiple queries.

The difference can probably be seen best based on the new interface:

chdb_init_result * chdb_connect(int argc, char ** argv);
void chdb_disconnect(ChdbLocalServerPtr obj);
chdb_local_result * chdb_query(ChdbLocalServerPtr obj, char * query, char * format);
void chdb_free_result(ChdbLocalServerPtr obj, chdb_local_result * result);

You'd call chdb_connect once which returns a pointer to a client. You'd then call chdb_query (each time followed by chdb_free_result) any number of times, passing in the pointer to that client. And at the end call chdb_disconnect.

I implemented this mainly because it means queries finish quite a bit faster in certain cases. But a nice side effect is that this also adds support for stateful queries like USE some_db.

How do queries get faster? I noticed that the cleanup function of LocalServer can take quite a bit of time. And this gets amplified with each additional created MergeTree table. In my tests (on M1 Mac), each (empty) MergeTree table adds about 0.15s of cleanup time, even when the table isn’t actually used in the respective query. This happens also in clickhouse-local, so it’s not specific to chdb.

The cleanup function normally gets executed after each query before returning results in chdb. In the fork it gets executed only when destroying the instance, i.e. when calling chdb_disconnect. So on each individual query, there is no need to wait for cleanup.

I guess this is essentially the same difference as between running clickhouse local --path . --query "select 1" (cleanup right after the query), and running the query in an interactive session started with clickhouse local --path . (cleanup only when killing the session).


This PR (unfortunately) also contains some other changes, as it would have been hard to separate those now.

The other changes are:

Dennitz commented 1 year ago

Also in case it's useful, I use this in Rust through this binding:

use std::ffi::{CStr, CString};
use std::os::raw as libc;

use log::debug;

use crate::Error::GenericError;

#[repr(C)]
struct chdb_local_result {
    buf: *mut libc::c_char,
    len: usize,
    error_message: *mut libc::c_char,
}

#[repr(C)]
struct chdb_init_result
{
    local_server: *mut libc::c_void,
    error_message: *mut libc::c_char,
}

#[link(name = "chdb")]
extern "C" {
    fn chdb_free_result(local_server: *mut libc::c_void, result: *mut chdb_local_result);
    fn chdb_connect(argc: i32, argv: *const *const libc::c_char) -> *mut chdb_init_result;
    fn chdb_disconnect(local_server: *mut libc::c_void);
    fn chdb_query(
        local_server: *mut libc::c_void,
        query: *const libc::c_char,
        format: *const libc::c_char,
    ) -> *mut chdb_local_result;
}

pub struct RawChdbClient {
    local_server: *mut libc::c_void,
}

impl RawChdbClient {
    pub fn connect(path: &str) -> crate::Result<RawChdbClient> {
        let argv: [*const libc::c_char; 3] = [
            CString::new("clickhouse").unwrap().into_raw(),
            CString::new("--multiquery").unwrap().into_raw(),
            CString::new(format!("--path={}", path)).unwrap().into_raw(),
        ];

        let result = unsafe { chdb_connect(3, argv.as_ptr()) };

        unsafe {
            drop(CString::from_raw(argv[0] as *mut libc::c_char));
            drop(CString::from_raw(argv[1] as *mut libc::c_char));
            drop(CString::from_raw(argv[2] as *mut libc::c_char));
        }

        if result.is_null() {
            debug!("connect returned nullptr");
            return Err(GenericError("Could not connect to chdb".to_string()));
        }

        unsafe {
            if !(*result).error_message.is_null() {
                let error = CStr::from_ptr((*result).error_message).to_string_lossy().into_owned();
                debug!("result has error message: {}", error);
                return Err(GenericError(error));
            }
        }

        unsafe {
            if (*result).local_server.is_null() {
                return Err(GenericError("Could not connect to chdb, nullptr returned".to_string()));
            }
        }

        Ok(RawChdbClient { local_server: unsafe { (*result).local_server } })
    }

    pub fn query(&self, query: &str, format: &str) -> crate::Result<Option<Vec<u8>>> {
        debug!("Rust: starting chdb query process for query: {}", query);
        let query_c = CString::new(query).unwrap();
        let format_c = CString::new(format).unwrap();
        let result = unsafe {
            chdb_query(self.local_server, query_c.as_ptr(), format_c.as_ptr())
        };
        debug!("Rust: running query done.");

        if result.is_null() {
            debug!("result is null");
            return Ok(None);
        }

        unsafe {
            if !(*result).error_message.is_null() {
                let error = CStr::from_ptr((*result).error_message).to_string_lossy().into_owned();
                chdb_free_result(self.local_server, result);
                debug!("result has error messsage: {}", error);
                return Err(GenericError(error));
            }
        }

        unsafe {
            if (*result).buf.is_null() {
                chdb_free_result(self.local_server, result);
                return Err(GenericError("Unknown error occurred".to_string()));
            }
        }

        let slice = unsafe { std::slice::from_raw_parts((*result).buf as *const u8, (*result).len) };
        let output = slice.to_vec();

        unsafe { chdb_free_result(self.local_server, result) };

        Ok(Some(output))
    }

    pub fn disconnect(&self) {
        unsafe {
            chdb_disconnect(self.local_server)
        }
    }
}

Where crate::Result is a type wrapping a custom Error from my project:

#[derive(Debug, thiserror::Error)]
pub enum Error {
    #[error("{0}")]
    GenericError(String),
    // More errors here...
}

pub type Result<T> = std::result::Result<T, Error>;
lmangani commented 1 year ago

@Dennitz thanks! 💪 This is useful. We'll most likely use this draft as a reference to reimplement the changes! As of the rust part, any chance you could merge your binding into chdb-rust to make it a proper crate for libchdb?

Dennitz commented 1 year ago

As of the rust part, any chance you could merge your binding into chdb-rust to make it a proper crate for libchdb?

Sure, once the other changes make it into chdb, I can do that.

Dennitz commented 1 year ago

One important thing to note: This implementation is not thread safe.

So all chdb_query and chdb_free_result calls and the chdb_disconnect call must be made from the same thread as the respective chdb_connect call.

CLAassistant commented 11 months ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 3 committers have signed the CLA.

:x: lmangani
:x: nmreadelf
:x: Dennitz
You have signed the CLA already but the status is still pending? Let us recheck it.

auxten commented 11 months ago

the main branch is rebased, will try to use interactive clickhouse-local to optimize chdb later