Open mmerdes opened 1 year ago
Hi!
If I understand Kotlin dataframes correctly, computations are done directly in the JVM.
Yes
Looking forward to hear what you think about the feasibility of such an approach.
First of all, thank you for sharing the post about DuckDB. I didn't know about it. I'm aware of Apache Arrow and we indeed consider implementing other backend as a long term goal of the library, so this information should be useful :) Although, i wouldn't expect other backends to be implemented in near future as we are more focused on improving experience for in-memory computation use case and experimenting with more advanced tooling
@koperagen, Thank you for the super-fast feedback!
Hi again. I was exploring posibilities for different computational backends and couldn't fully understand purpose and implemenetation of integration between Apache Arrow and duckDB. I assume scope of these libraries intersect greatly, but providing zero-copy integration enables some convenience..? If you use Apache Arrow <-> duckDB integration and call something like filter, result will be computed by Arrow? Also, i couldn't find if this integration is available in Java API, only Python. Is that true? @mmerdes
hi @koperagen, sorry for the delay! yes, Arrow does have support for Java: https://arrow.apache.org/docs/java/
If I am not mistaken, Arrow is more about the storage layout/format (a columnar format optimized for computations) than about the computations themselves. This format is the same for storing in memory and transmitting over the network. So in case of huge data sets, a lot of time is saved by not (de)serializing data when crossing the border between applications and even between local memory and network buffers.
So while arrow greatly speeds up transferring the data around the computation is still done in the respective environment (e.g., DuckDB in your original example). Hope this helps.
For more solid advice, it might be helpful to contact the Apache Arrow people. I could imagine they would be interested in more integrations and able to give some pointers.
Your explanations made things a lot clearer, thank you. I'll continue researching this topic :)
I found that DuckDB Java API also offers export / import of data in Apache Arrow format. So, if we manage to fit DataFrame model into Apache Arrow format, it should be possible to move your data between duckdb / kotlin code without copying. This way one can compute something in duckdb, then move to Kotlin and perform some operations, then again offload heavy computation to duckdb. You're proposing this kind of integration or something else? Maybe you have some specific use cases that can be enabled with it?
Yes, along these lines. Key is the copy-less transfer between the environments which allows for selecting the optimal computation environment for the task at hand. This should be generally useful beyond specific use cases.
If I understand Kotlin dataframes correctly, computations are done directly in the JVM. For many use-cases, the optimized data storage and vectorized computations of DuckDB could be very useful in terms of performance and memory consumption.
So maybe there could be a way to move computations to DuckDB via Apache Arrow, which offers a columnar memory format, zero-copy reads, streaming, and an extensive Java API.
The goal would be to perform analytical computations against DuckDB directly from Kotlin in a way similar to the one described here for Python and R: https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/
Looking forward to hear what you think about the feasibility of such an approach.