Support DuckDB as computation backend

Kotlin / dataframe

Structured data processing in Kotlin

https://kotlin.github.io/dataframe/overview.html

Apache License 2.0

799 stars 55 forks source link

Support DuckDB as computation backend #186

Open mmerdes opened 1 year ago

mmerdes commented 1 year ago

If I understand Kotlin dataframes correctly, computations are done directly in the JVM. For many use-cases, the optimized data storage and vectorized computations of DuckDB could be very useful in terms of performance and memory consumption.

So maybe there could be a way to move computations to DuckDB via Apache Arrow, which offers a columnar memory format, zero-copy reads, streaming, and an extensive Java API.

The goal would be to perform analytical computations against DuckDB directly from Kotlin in a way similar to the one described here for Python and R: https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/

Looking forward to hear what you think about the feasibility of such an approach.

koperagen commented 1 year ago

Hi!

If I understand Kotlin dataframes correctly, computations are done directly in the JVM.

Yes

Looking forward to hear what you think about the feasibility of such an approach.

First of all, thank you for sharing the post about DuckDB. I didn't know about it. I'm aware of Apache Arrow and we indeed consider implementing other backend as a long term goal of the library, so this information should be useful :) Although, i wouldn't expect other backends to be implemented in near future as we are more focused on improving experience for in-memory computation use case and experimenting with more advanced tooling

mmerdes commented 1 year ago

@koperagen, Thank you for the super-fast feedback!

koperagen commented 1 year ago

Hi again. I was exploring posibilities for different computational backends and couldn't fully understand purpose and implemenetation of integration between Apache Arrow and duckDB. I assume scope of these libraries intersect greatly, but providing zero-copy integration enables some convenience..? If you use Apache Arrow <-> duckDB integration and call something like filter, result will be computed by Arrow? Also, i couldn't find if this integration is available in Java API, only Python. Is that true? @mmerdes

mmerdes commented 1 year ago

hi @koperagen, sorry for the delay! yes, Arrow does have support for Java: https://arrow.apache.org/docs/java/

mmerdes commented 1 year ago

If I am not mistaken, Arrow is more about the storage layout/format (a columnar format optimized for computations) than about the computations themselves. This format is the same for storing in memory and transmitting over the network. So in case of huge data sets, a lot of time is saved by not (de)serializing data when crossing the border between applications and even between local memory and network buffers.

mmerdes commented 1 year ago

So while arrow greatly speeds up transferring the data around the computation is still done in the respective environment (e.g., DuckDB in your original example). Hope this helps.

For more solid advice, it might be helpful to contact the Apache Arrow people. I could imagine they would be interested in more integrations and able to give some pointers.

koperagen commented 1 year ago

Your explanations made things a lot clearer, thank you. I'll continue researching this topic :)

koperagen commented 1 year ago

I found that DuckDB Java API also offers export / import of data in Apache Arrow format. So, if we manage to fit DataFrame model into Apache Arrow format, it should be possible to move your data between duckdb / kotlin code without copying. This way one can compute something in duckdb, then move to Kotlin and perform some operations, then again offload heavy computation to duckdb. You're proposing this kind of integration or something else? Maybe you have some specific use cases that can be enabled with it?

mmerdes commented 1 year ago

Yes, along these lines. Key is the copy-less transfer between the environments which allows for selecting the optimal computation environment for the task at hand. This should be generally useful beyond specific use cases.