chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.76k stars 414 forks source link

[Feature Request]: Add support for DuckDB #25198

Open rdjonsson opened 3 weeks ago

rdjonsson commented 3 weeks ago

Add support for DuckDB

(Related to #17012, since DuckDB supports parquet files natively)

Description: DuckDB is an in-process database (similar to SQLite). Besides SQL capabilities being useful for data preparation tasks, it also has impressive support for importing data from files, including csv, json, and parquet. Extensions provide even more import functions, including for many different geodata formats.

Is this issue currently blocking your progress? In one sense not at all, I have not really started to use Chapel for anything serious yet. But on the other hand, not having to worry about data input and output would speed up adoption if I wanted to use it.

I'm unfortunately not familiar enough with C and Chapel to tell whether it would be straightforward or not to provide a wrapper on top of the existing C api for DuckDB.

bradcray commented 3 weeks ago

@rdjonsson : Since you mention data preparation, csv, and parquet, I wanted to make sure you are familiar with Arkouda, which supports an in-memory columnar format, DataFrames, CSV, and Parquet as a Python library where the implementation is in Chapel. That's not to suggest it does everything DuckDB does, but if one only wanted Parquet, CSV, and Dataframes, it'd be a good thing to check out.

Calling from Chapel to C is reasonably easy (often as easy as writing extern { #include "myCheader.h" } and making the calls). Of course, often a next step is to raise the level of abstraction from C to something that leverages Chapel features and types more (to avoid the need to explicitly take and pass pointers, e.g.).

For me a big question in supporting DuckDB from within Chapel would be whether you would want it to do something smart or global in a distributed-memory setting, or are only interested in using it for single-process ("single-locale" in Chapel terminology) executions. Please let us know.

rdjonsson commented 3 weeks ago

@bradcray Thanks for the feedback! I'll outline my use case and try to clarify what I see DuckDB solving. (apologies, it ran a bit long)

The code I'm thinking of porting to (rewriting in) Chapel is an agent based simulation of household travel and activity patterns. At the moment it exists in a couple of versions written in C# and F#. They rely on MKL for vector/matrix maths where it makes sense, and we have put quite some work into the algorithms to get them to run fast on a single machine. We also used a C# wrapper for MPI for a while to scale out to a compute cluster, but it was a very brittle piece of code that we can't see ourselves maintaining.

But some of the applications and future directions we have for the simulation system will push it over the limit of what we can run on a workstation so I am looking at clusters again. The input data typically consists of a lists of households and agents, land use information, and information about the transport network. Households, agents and land use information are reasonable to represent as records, and their numbers are somewhere in the thousands to the millions on a regional/metropolitan area scale of model. The network is sometimes represented as a graph, and sometimes as matrices [origin, destination, mode] of information such as travel time, cost etc. None are especially challenging in size by themselves, but the book-keeping gets a bit complicated, since an agent can live in one zone, work in another, and potentially travel between each and every pair of zones for other activities. In terms of size, I think a single-locale database would be sufficient.

We are solving a Markov Decision Problem (or a Bellman eq. if you will) with the state space and the actions connecting the states defined by the data above (e.g. you start a trip from home (agent record) choose to go shopping in a zone (land use record) with shops in it, arriving there after a given travel time (network matrix)). I have come to think that this book-keeping can make good use of ranges and domains in Chapel. This is also why I do not think Arkouda necessarily fits our use case. We want to be able to be smart about both the matrix/vector computations and the traversal of the state-action network in the algorithm, which I think is something that Chapel itself is a better fit for than Python.

The thing I am envisaging DuckDB to help with is the task of translating the data files (csv, parquet, geopackage etc) into the appropriate Chapel data structures for efficient parallel computing. Since this is an active research tool, we constantly try to bring in new data, and try new things with the algorithms, and rewriting boiler-plate code to parse input files is tedious and error prone. So I think, at least initially, my ask is precisely the 'raise the level of abstraction' part of your answer. What would a nice API for instance look like if I want to take the data from a query and partition it on many locales for parallel processing? Or reshaping data from a database (or parquet) table to a multidimensional matrix. We have to constantly translate between the geospatial view (files suitable for GIS or visualisation tools) and the abstract state space view suitable for fast computations, both on the input and the output side.

A separate benefit that I can see is that I often have to bring in data to a model that it is awkward to view, either because it is in a binary format or that it is too big to open. With DuckDB you can import the data, run the queries you need to produce the data in the shape you want model in a separate notebook but with a 'SELECT TOP 25' just to check, before you run the full Chapel model.

Longer term, I could perhaps see some multi-locale uses as well. If you have database tasks (e.g. joins or spatial intersections) that are easy to express as SQL but a lot of work to implement from scratch, maybe spinning up a bunch of DuckDB databases and partition out the data to them could be useful. Parallel reading and processing of input could be another. But that maybe makes more sense as a separate issue for a later time.

bradcray commented 3 weeks ago

Thanks for the additional information and context for your work @rdjonsson , which is very helpful.

How big is the DuckDB API — either overall, or in terms of what you'd want/need access to from Chapel for your work? (or potentially for a proof of concept of some initial aspect of the work to test the waters?)

And: Is this something you'd be interested in having a synchronous meeting with some of us on the Chapel side about, to try and scope out what getting some proof of concept up and running would look like? We're a little busy with this month's 2.1 release at the moment, but when that quiets down, it'd be interesting to see how we could help here and what level of effort it seems like it would take (either doing it ourselves, or teaching you enough to figure out how to wrap it yourself).

rdjonsson commented 3 weeks ago

I'd be happy to meet if that helps. I updated my github profile to include a link to my university page with contact info. Let's try to set something up via email.

The C API is fairly low level, it is a bit hard to tell how much of it you need to touch to do a proof-of-concept query. I suspect it would only be a handful of function calls. But for a more serious attempt, I suspect that there are clever things you can do using more of it, to for instance get efficient data copying by knowing how it is stored in DuckDB.

It is well documented, which should be a big help: https://duckdb.org/docs/api/c/overview

bradcray commented 2 weeks ago

@rdjonsson : Finally got back to this and reached out by email—let me know if it doesn't reach you.