My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.
This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).
Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:
(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).
One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.
Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.
We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.
This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".
I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.
The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.
There's the same kind of suboptimal writing in the serde module, e.g. it uses visit_str instead of visit_borrowed_str.
Looks like this (and possibly other issues) should be ported over to the jira repo.
This is a follow up of #193.
My understanding is that this crate is optimized for reading to (the owning enum)
Value
, that encapsulates all types, like e.g. serde-json does.This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new
String
becauseValue
ownsString
, even when the values were already allocated into aVec<u8>
during buffering + decompression (SeeReader
andBlock
). This happens to all non-copy types (String
,Bytes
,Fixed
andEnum
).Formats such as arrow have their own way of storing strings. Thus, currently, using
Value::String(String)
adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from aRead
of a uncompressed file with 1 column of strings to Value is:This yields an extra allocation per row per column of non-copy types (
String
,Bytes
,Fixed
andEnum
).Usually, it is more efficient to read it to a
Vec<u8>
and produce non-owned values from it, in this case this would be(there are other tricks to avoid re-allocating a
Vec<u8>
per Block + decompression which can be discussed in separate).One way to address this is to declare a "non-owned"
Value
(e.g.ValueRef::String(&'a str)
), and have the decoder to be of the formdecode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)>
whereusize
is the number of read bytes.Since each row block has a declared number of bytes upfront and we read them to a
buf: Vec<u8>
as part of the buffering inReader
+ decompression, we can leverage the existence of thisbuf
to decode them to non-owningValueRef
, thereby avoiding the allocation per row per column.We can offer
impl<'a> From<ValueRef<'a>> for Value
that callsto_string()
(and the like to other types such asBytes
), if the user wishes to remove the lifetime at the cost of an allocation.This is a pretty large, though, because we would need to also change the signatures of
zag_i64
and the like to read from&[u8]
instead ofR: Read
, since they must return how many bytes were consumed to "advance pointer of&[u8]
".I recently did this type of exercise as part of a complete re-write of the
parquet
crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page toVec<u8>
before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.