JuliaDatabases / SQLite.jl

A Julia interface to the SQLite library
https://juliadatabases.org/SQLite.jl/stable
Other
224 stars 78 forks source link

[Question] How can I save a vector as raw binary blob? #338

Closed maxfreu closed 4 months ago

maxfreu commented 7 months ago

Hi! I have a dataframe column containing vectors of 10 Int16s. I would like to save the vectors as 20 bytes of blob data. How can I do that? Right now I work around it by converting the reinterpreted chars to a string, but that has issues with null termination etc.

quinnj commented 6 months ago

It's a little hard to tell what you're trying to do; can you share some example code of what you would like to do or what you're currently doing and problems you're having? Having concrete code example to work with can help in answering your question.

maxfreu commented 6 months ago

Actually, my question was imprecise. It's more directed towards how blob data can be written without unnecessary copies.

# My data looks like this, just with 80 million rows:
data = [rand(UInt16, 10) for _ in 1:10]

# what I want is writing the data as contiguous blob (NOT serialized julia structs)
# I achieve this like so:
data2blob(v) = collect(reinterpret(UInt8, v))

df = DataFrame(:foo => data2blob.(data))
db = SQLite.DB("deleteme.sqlite")
SQLite.load!(df, db, "foo")
close(db)

The resulting file has a blob column with the correct data written to it. However, I'd like to avoid calling collect on 1.6GB of data. But when I leave it away, like so:

df = DataFrame(:foo => reinterpret.(UInt8, data))

the julia types get serialized somehow before being written. This makes kind of sense, but then I can't read it into other programs anymore. Maybe it would be good to special-case reinterpret arrays of basic integer types somewhere in the code?

quinnj commented 6 months ago

Yeah, that makes sense to me. It might just be that we're supporting Vector{UInt8}, but could make it AbstractVector{UInt8} to store as blobs.

maxfreu commented 6 months ago

Oh yes, relaxing to AbstractVector{UInt8} is way better than specializing for reinterpret arrays. Where would that go?