ancapdev / LightBSON.jl

High performance encoding and decoding of BSON data in Julia
MIT License
19 stars 4 forks source link

Multi dimensional arrays? #7

Open jw3126 opened 2 years ago

jw3126 commented 2 years ago

First of all, thanks a lot for creating this package. I think it is awesome to have a less magical alternative to BSON.jl. The following snippet gives an error:

import LightBSON
arr = rand(2,2)
path = "hello.bson"
LightBSON.bson_write(path, arr)
LightBSON.bson_read(typeof(arr), path)
MethodError: no method matching Matrix{Float64}()

I think it would awesome if load+save of Base.Array would conveniently work out of the box.

ancapdev commented 2 years ago

BSON itself doesn't have a concept of multi-dimensional arrays, so that's why I haven't added it. It would require encoding some metadata about the array, and then it's no longer plain BSON, which is what this package is targeted at.

BSON array support is in general not very strong (they're actually encoded as objects with numeric keys for each element, see spec). What I've used myself for array data is zfp compression via ZfpCompression.jl stored as BSON binary objects. This takes care of dimensionality and type encoding (for the supported numeric types), along with compression if data has some correlation to it. What might be useful would be if we could standardize on some BSON binary subtypes across the ecosystem.

jw3126 commented 2 years ago

Ok thanks a lot for the clarifications. These are reasonable design decisions. Thanks also for the ZfpCompression.jl recommendation. In my case, arrays contain structs that can again contain other arrays. So I think ZfpCompression is not quite applicable.

What might be useful would be if we could standardize on some BSON binary subtypes across the ecosystem.

What does this mean? A couple of conventions to serialize say array as (size=size(arr), data=vec(arr))?

ancapdev commented 2 years ago

What might be useful would be if we could standardize on some BSON binary subtypes across the ecosystem.

What does this mean? A couple of conventions to serialize say array as (size=size(arr), data=vec(arr))?

Binary values in BSON have an associated subtype tag, some are defined in the spec, some are reserved, and some are open for user extension. There's been some discussion on MongoDB support tickets and other forums about adding new spec defined subtypes as a means to support e.g., typed arrays. Doesn't sound like that would be useful for your use case, but in general it would make storing large arrays of primitive types much more efficient.

jw3126 commented 2 years ago

Ok I see thanks @ancapdev !