ancapdev / LightBSON.jl

High performance encoding and decoding of BSON data in Julia
MIT License
20 stars 3 forks source link
bson bson-format julia julia-package

LightBSON

Build Status codecov

High performance encoding and decoding of BSON data.

What It Is

What It Is Not

High Level API

Performance sensitive use cases are the focus of this package, but a high level API is available for where performance is less of a concern. The high level API will also perform optimally when reading simple types without strings, arrays, or other fields that require allocation.

Low Level API

Example

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64(123)
close(writer)
reader = BSONReader(buf)
reader["x"][Int64] # 123

Nested Documents

Nested documents are written by assigning a field with a function that takes a nested writer. Nested fields are read by index.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = nested_writer -> begin
    nested_writer["a"] = 1
    nested_writer["b"] = 2
end
close(writer)
reader = BSONReader(buf)
reader["x"]["a"][Int64] # 1
reader["x"]["b"][Int64] # 2

Arrays

Arrays are written by assigning array values, or by generators. Arrays can be materialized to Julia arrays, or be accessed by index.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64[1, 2, 3]
writer["y"] = (Int64(x * x) for x in 1:3)
close(writer)
reader = BSONReader(buf)
reader["x"][Vector{Int64}] # [1, 2, 3]
reader["y"][Vector{Int64}] # [1, 4, 9]
reader["x"][2][Int64] # 2
reader["y"][2][Int64] # 4

Dict and Any

Where performance is not a concern, elements can be materialized to dictionaries (recursively) or to the most appropriate Julia type for leaf fields. Dictionaries can also be written directly.

buf = UInt8[]
writer = BSONWriter(buf)
writer[] = LittleDict{String, Any}("x" => Int64(1), "y" => "foo")
close(writer)
reader = BSONReader(buf)
reader[Dict{String, Any}] # Dict{String, Any}("x" => 1, "y" => "foo")
reader[LittleDict{String, Any}] # LittleDict{String, Any}("x" => 1, "y" => "foo")
reader[Any] # LittleDict{String, Any}("x" => 1, "y" => "foo")
reader["x"][Any] # 1
reader["y"][Any] # "foo"

Arrays With Nested Documents

Generators can be used to directly write nested documents in arrays.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = (
    nested_writer -> begin
        nested_writer["a"] = Int64(x)
        nested_writer["b"] = Int64(x * x)
    end
    for x in 1:3
)
close(writer)
reader = BSONReader(buf)
reader["x"][Vector{Any}] # Any[LittleDict{String, Any}("a" => 1, "b" => 1), LittleDict{String, Any}("a" => 2, "b" => 4), LittleDict{String, Any}("a" => 3, "b" => 9)]
reader["x"][3]["b"][Int64] # 9

Read Abstract Types

The abstract types Number, Integer, and AbstractFloat can be used to materialize numeric fields to the most appropriate Julia type under the abstract type.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int32(123)
writer["y"] = 1.25
close(writer)
reader = BSONReader(buf)
reader["x"][Number] # 123
reader["x"][Integer] # 123
reader["y"][Number] # 1.25
reader["y"][AbstractFloat] # 1.25

Unsafe String

String fields can be materialized as WeakRefString{UInt8}, aliased as UnsafeBSONString. This will create a string object with pointers to the underlying buffer without performing any allocations. User must take care of GC safety.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = "foo"
close(writer)
reader = BSONReader(buf)
s = reader["x"][UnsafeBSONString]
GC.@preserve buf s == "foo" # true

Binary

Binary fields are represented with BSONBinary.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = BSONBinary([0x1, 0x2, 0x3], BSON_SUBTYPE_GENERIC_BINARY)
close(writer)
reader = BSONReader(buf)
x = reader["x"][BSONBinary]
x.data # [0x1, 0x2, 0x3]
x.subtype # BSON_SUBTYPE_GENERIC_BINARY

Unsafe Binary

Binary fields can be materialized as UnsafeBSONBinary for zero alocations, where the data field is an UnsafeArray{UInt8, 1} with pointers into the underlying buffer. User must take care of GC safety.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = BSONBinary([0x1, 0x2, 0x3], BSON_SUBTYPE_GENERIC_BINARY)
close(writer)
reader = BSONReader(buf)
x = reader["x"][UnsafeBSONBinary]
GC.@preserve buf x.data == [0x1, 0x2, 0x3] # true
x.subtype # BSON_SUBTYPE_GENERIC_BINARY

Iteration

Fields can be iterated with foreach() or using the Transducers.jl APIs. Fields are represented as Pair{UnsafeBSONString, BSONReader}.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64(1)
writer["y"] = Int64(2)
writer["z"] = Int64(3)
close(writer)
reader = BSONReader(buf)
reader |> Map(x -> x.second[Int64]) |> sum # 6
foreach(x -> println(x.second[Int64]), reader) # 1\n2\n3\n

Indexing

BSON field access involves a linear scan to find the matching field name. Depending on the size and structure of a document, and the fields being accessed, it migh be preferable to build an index over the fields first, to be re-used on every access.

BSONIndex provides a very light weight partial (collisions evict previous entries) index over a document. It is designed to be re-used from document to document, by means of a constant time reset. IndexedBSONReader wraps a reader and an index to accelerate field access in a document. Index misses fall back to the wrapped reader.

buf = UInt8[]
writer = BSONWriter(buf)
writer["x"] = Int64(1)
writer["y"] = Int64(2)
writer["z"] = Int64(3)
close(writer)
index = BSONIndex(128)
# Index is built when IndexBSONReader is constructed
reader = IndexedBSONReader(index, BSONReader(buf))
reader["z"][Int64] # 3 -- accessed by index

empty!(buf)
writer = BSONWriter(buf)
writer["a"] = Int64(1)
writer["b"] = Int64(2)
writer["c"] = Int64(3)
close(writer)
# Index can be re-used
reader = IndexedBSONReader(index, BSONReader(buf))
reader["b"][Int64] # 2 -- accessed by index
reader["x"] # throws KeyError

Validation

BSONReader can be configured with a validator to use during the processing of the input document; 3 are provided:

Structs

Structs can be automatically translated to and from BSON, provided all their fields can be represented in BSON. Traits functions are used to select the mode of conversion, and these serve as an extension point for user defined types.

Generic

Provided bson_simple(T) is false, serialization will use the StructTypes.jl API to iterate fields of T and to construct T. See StructTypes.jl for more details.

Simple

For simple types using the bson_simple(T) trait will generate faster serialization code. This needs only be explicitly set if StructTypes.StructType(T) is also defined.

struct NestedSimple
    a::Int64
    b::Float64
end

struct Simple
    x::String
    y::NestedSimple
end

StructTypes.StructType(::Type{Simple}) = StructTypes.Struct()
LightBSON.bson_simple(::Type{Simple}) = true

buf = UInt8[]
writer = BSONWriter(buf)
writer["simple"] = Simple("foo", NestedSimple(123, 1.25))
close(writer)
reader = BSONReader(buf)
reader["simple"][Simple] # Simple("foo", NestedSimple(123, 1.25))

# Structs can also be written to the root of the document
buf = UInt8[]
writer = BSONWriter(buf)
writer[] = Simple("foo", NestedSimple(123, 1.25))
close(writer)
reader = BSONReader(buf)
reader[Simple] # Simple("foo", NestedSimple(123, 1.25))

Schema Evolution

For long term persistence or long lived APIs, it may be advisable to encode information about schema versions in documents, and implement ways to evolve schemas through time. Specific strategies for schema evolution are beyond the scope of this package to advise or impose, rather extension points are provided for users to implement the mechanisms best fit to their use cases.

struct Evolving1
    x::Int64
end

LightBSON.bson_schema_version(::Type{Evolving1}) = Int32(1)

struct Evolving2
    x::Int64
    y::Float64
end

# Construct from old version, defaulting new fields
Evolving2(old::Evolving1) = Evolving2(old.x, NaN)

LightBSON.bson_schema_version(::Type{Evolving2}) = Int32(2)

function LightBSON.bson_read_versioned(::Type{Evolving2}, v::Int32, reader::AbstractBSONReader)
    if v == 1
        Evolving2(bson_read_unversioned(Evolving1, reader))
    elseif v == 2
        bson_read_unversioned(Evolving2, reader)
    else
        # Real world application may instead want a mechanism to allow forward compatibility,
        # e.g., by encoding breaking vs non-breaking change info in the version
        error("Unsupported schema version $v for Evolving")
    end
end

const Evolving = Evolving2

buf = UInt8[]
writer = BSONWriter(buf)
# Write old version
writer[] = Evolving1(123)
close(writer)
reader = BSONReader(buf)
# Read as new version
reader[Evolving] # Evolving2(123, NaN)

Named Tuples

Named tuples can be read and written like any struct type.

buf = UInt8[]
writer = BSONWriter(buf)
writer[] = (; x = "foo", y = 1.25, z = (; a = Int64(123), b = Int64(456)))
close(writer)
reader = BSONReader(buf)
reader[@NamedTuple{x::String, y::Float64, z::@NamedTuple{a::Int64, b::Int64}}] # (x = "foo", y = 1.25, z = (a = 123, b = 456))

Faster Buffer

Since BSONWriter itself is immutable, it makes frequent calls to resize the underlying array to track the write head position. Unfortunately at present, this is not a well optimized operation in Julia, resolving to C-calls for manipulating the state of the array.

BSONWriteBuffer wraps a Vector{UInt8} to track size purely in Julia and avoid most of these calls. It implements the minimum API necessary for use with BSONReader and BSONWriter, and is not for use as a general Array implementation.

buf = BSONWriteBuffer()
writer = BSONWriter(buf)
writer["x"] = Int64(123)
close(writer)
reader = BSONReader(buf)
reader["x"][Int64] # 123
buf.data # Underlying array, may be longer than length(buf)

Performance

Performance naturally will depend very much on the nature of data being processed. The main overarching goal with this package is to enable the highest possible performance where the user requires it and is willing to sacrifice some convenience to achieve their target.

General advice for high performance BSON schema, such as short field names, avoiding long arrays or documents, and using nesting to reduce search complexity, all apply.

For LightBSON.jl specifically, prefer strings over symbols for field names, use unsafe variants rather than allocating strings and buffers where possible, reuse buffers and indexes, use BSONWriteBuffer rather than plain Vector{UInt8}, and enable bson_simple(T) for all applicable types.

Here's an example benchmark, reading and writing a named tuple with nesting (run on a Ryzen 5950X).

x = (;
    f1 = 1.25,
    f2 = Int64(123),
    f3 = now(UTC),
    f4 = true,
    f5 = Int32(456),
    f6 = d128"1.2",
    f7 = (; x = uuid4(), y = 2.5)
)

@btime begin
    buf = $(UInt8[])
    empty!(buf)
    writer = BSONWriter(buf)
    writer[] = $x
    close(writer)
end # 78.730 ns (0 allocations: 0 bytes)

@btime begin
    buf = $(BSONWriteBuffer())
    empty!(buf)
    writer = BSONWriter(buf)
    writer[] = $x
    close(writer)
end # 58.303 ns (0 allocations: 0 bytes)

buf = UInt8[]
writer = BSONWriter(buf)
writer[] = x
close(writer)
@btime BSONReader($buf)[$(typeof(x))] # 103.825 ns (0 allocations: 0 bytes)
@btime BSONReader($buf, UncheckedBSONValidator())[$(typeof(x))] # 102.549 ns (0 allocations: 0 bytes)
@btime IndexedBSONReader($(BSONIndex(128)), BSONReader($buf))[$(typeof(x))] # 86.876 ns (0 allocations: 0 bytes)
@btime IndexedBSONReader($(BSONIndex(128)), BSONReader($buf, UncheckedBSONValidator()))[$(typeof(x))] # 84.987 ns (0 allocations: 0 bytes)
@btime IndexedBSONReader($(BSONIndex(128)), BSONReader($buf)) # 47.896 ns (0 allocations: 0 bytes)
@btime $(IndexedBSONReader(BSONIndex(128), BSONReader(buf)))[$(typeof(x))] # 41.434 ns (0 allocations: 0 bytes)

We can observe BSONWriteBuffer makes a material difference to write performance, while indexing, in this case, has a reasonable effect on read performance even for this small document. In the final two lines we can see that indexing and reading the indexed document breaks down roughly half/half. Using the unchecked validator has a smaller impact, and must be used with caution.

ObjectId

BSONObjectId implements the BSON ObjectId type in accordance with the spec. New IDs for the process can be generated by construction, or using bson_object_id_range(n) for more efficient batch generation.

id1 = BSONObjectId()
DateTime(id1) # The UTC timestamp for when id1 was generated, truncated to seconds
id2 = BSONObjectId()
id1 != id2 # true, IDs are unique
collect(bson_object_id_range(5)) # 5 IDs with equal timestamp and different sequence field

Related Packages