flavray / avro-rs

Avro client library implementation in Rust
MIT License
169 stars 95 forks source link

Reduce allocations in reading #195

Open jorgecarleitao opened 2 years ago

jorgecarleitao commented 2 years ago

This is a follow up of #193.

My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.

This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).

Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> String - [move] -> Value

This yields an extra allocation per row per column of non-copy types (String, Bytes, Fixed and Enum).

Usually, it is more efficient to read it to a Vec<u8> and produce non-owned values from it, in this case this would be

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> &'a str -> Value<'a>

(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).

One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.

Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.

We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.

This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".

I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.

The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.

Ten0 commented 2 years ago

There's the same kind of suboptimal writing in the serde module, e.g. it uses visit_str instead of visit_borrowed_str. Looks like this (and possibly other issues) should be ported over to the jira repo.