apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
366 stars 93 forks source link

Implement ingestion of list types for SQLite #2212

Open danielballan opened 1 week ago

danielballan commented 1 week ago

What feature or improvement would you like to see?

In https://github.com/apache/arrow-adbc/issues/2066 @skarakuzu and I outlined a use case for streaming LIST data into PostgreSQL and SQLite via ADBC. Support for PostgreSQL was recently added. (Thanks, @paleolimbot!)

Would it be justifiable to add LIST support for SQLite as well? Our use case is embedded scientific data collection scenarios, where it is not feasible to deploy PostgreSQL but we'd like a compatible-as-possible data model. In #2066 had floated an idea of using a SQLite JSON or JSONB column for this, as SQLite is generally loose about types and lacks any array/list type.

We would be happy to try to contribute this if that is feasible.

WillAyd commented 1 week ago

Does SQLite actually have a JSON type or just functions for handling JSON strings/blobs? I see the latter in the documentation but not the former

https://sqlite.org/datatype3.html https://sqlite.org/json1.html

Unless I am overlooking something I think these would need to be stored as strings (?)

danielballan commented 6 days ago

Right. Evidently it does have a distinct internal data structure (JSONB) which is uses for some functionality but the data would be stored as a string.

paleolimbot commented 6 days ago

I think the main issue here is that the infrastructure for serializing an arbitrary list to JSON doesn't currently exist (in ADBC or nanoarrow). We probably don't need a JSON library to write simple things (e.g., lists of integers), but we do need to be sure that bit is well-tested (and that we want that scope to exist in ADBC).

lidavidm commented 4 days ago

Hmm, lists of integers or other simple things should be fine (though, JSON doesn't have integers!) but I'd worry that could scope creep into half a JSON writer. I wonder if we need to start considering some sort of plugin mechanism so we can optionally ship bits like this? (Alternatively, is there a JSON library light enough that we could embed it and not worry about the dependency?)

danielballan commented 3 days ago

Actually, I wonder SQLite has the functionality we need in its C API. Does json_array solve the serialization problem?

paleolimbot commented 2 days ago

Alternatively, is there a JSON library light enough that we could embed it and not worry about the dependency?

I have never had problems with nlohmann_json in nanoarrow (https://github.com/nlohmann/json), although we might not need it provided we can use to_chars() for locale-independent floating-point-to-string conversion (strings require escaping and binary require some special handling: https://github.com/apache/arrow-nanoarrow/blob/7c1fb36029a92ee039d2235be07c70d8726b2a7f/src/nanoarrow/testing/testing.cc#L54-L69 )

Does json_array solve the serialization problem?

I suppose we could generate a query with parameters like SELECT json_array(?, ?, ?, ?), bind the values, and grab the result? Unless json_array() is in the C API?

lidavidm commented 2 days ago

Yeah, I think it'd have to be a query. I glanced around and nothing seems to be in the header.

danielballan commented 2 days ago

Ah, got it. :+1: