ak-coram / cl-duckdb

Common Lisp CFFI wrapper around the DuckDB C API
https://github.com/ak-coram/cl-duckdb/blob/main/README.org
MIT License
36 stars 1 forks source link

How to read in data in batches or in columns? #24

Closed sirherrbatka closed 1 year ago

sirherrbatka commented 1 year ago

Hello, I am working on vellum-duckdb right now. I am wondering if there is way to not load the whole output of the query at once. My initial, proof of concept copy-from looks like this:

https://github.com/sirherrbatka/vellum-duckdb/blob/main/src/code.lisp

ak-coram commented 1 year ago

@sirherrbatka: This is on my TODO list, but I didn't get around to working out how the API should look like yet. We currently use data chunks for efficiency reasons instead of the row-based API functions, so you're basically loading 1024 (default DuckDB vector size) values from of one column at a time.

https://github.com/ak-coram/cl-duckdb/blob/c5b4a776bd4ae0f6f27132da3c0c2289db74ffc7/duckdb.lisp#L372

Basically the translate-result, translate-chunk and translate-vector functions are responsible for all of this (see duckdb.lisp), so maybe they could be adapted to fit your use-case. I'm also open for suggestions on what higher-level API you'd like to see for this in cl-duckdb.

ak-coram commented 1 year ago

Of course SQL LIMIT and OFFSET can be used as well to load parts of the results, but I imagine this is not what you're looking for.

sirherrbatka commented 1 year ago

Nah, melding with query itself seems a bit.

ak-coram commented 1 year ago

@sirherrbatka: could you please describe exactly what you're trying to achieve? Hopefully we can work something out fairly quickly :)

sirherrbatka commented 1 year ago

Right now I want to copy query results into the vellum data frame, so I can get support for parquet loading and other goodies offered duckdb. The thing is, I want to avoid wasting my memory. I will figure out what else I can do after I learn more.

ak-coram commented 1 year ago

Maybe you could use the lower-level API functions to directly write these results into your data frame: I recommend checking out the functions I mentioned (translate-result, etc.) for an example on how they work. If you need additional symbols exported from the library for this just let me know.

sirherrbatka commented 1 year ago

I will try, thanks!

sirherrbatka commented 1 year ago

Ok, it works. duckdb:handle probably should be exported from the package… How are NULLs marked in cl-duckdb?

ak-coram commented 1 year ago

NULLs are converted to nil, but so are boolean false values. Please see this section of the docs:

https://github.com/ak-coram/cl-duckdb#nil-as-boolean-false-vs-nil-as-null

The check for NULL values happens in duckdb::translate-vector via duckdb-api:duckdb-validity-row-is-valid:

https://duckdb.org/docs/api/c/data_chunk.html#duckdb_validity_row_is_valid

sirherrbatka commented 1 year ago

Ok! I just wanted to say that I am fond of the postmodern solution to this problem, namely: ALWAYS use symbol :NULL to indicate NULL value. That's a nice solution because it sidesteps "can't tell false from null" issue you also have stumbled upon.