Open ali2992 opened 5 years ago
Another approach might be to use std::borrow::Cow
. Specifically, Cow<'schema, str>
. I suspect Rc<String>
might be easier to implement since you don't have to propagate references through everything that currently accesses field names, and because Rc<String>
impls Deref
to String
, any usage of what was previously String
should remain unchanged.
@ali2992 Can you provide the schema that you used to benchmark?
As a general concept, though, I am not that sure that making the interface more complicated for the users is the right move, even if it is in the name of, don't get me wrong, very important performance gains. If you started using this library, I hope it was also because it was fairly easy to use ;)
@poros I started using this library not because it was easy to use, but because I wanted something performant for large data sets (simulation results) :/ so, hearing about the performance issues is a bit of a let down.
In terms of ease of usage, it has been a bit on the tougher side, but probably because I am somewhat new to Rust, and the docs are still a little light.
@bzm3r I've been working on a new format NoProto that has a majority of the useful features of Apache Avro with a simple API and significantly better performance.
Legend: Ops / Millisecond, higher is better
Library | Encode | Decode All | Decode 1 | Update 1 | Size (bytes) | Size (Zlib) |
---|---|---|---|---|---|---|
NoProto | 1057 | 1437 | 47619 | 12195 | 208 | 166 |
Apache Avro | 138 | 51 | 52 | 37 | 702 | 336 |
FlexBuffers | 401 | 855 | 23256 | 264 | 490 | 309 |
JSON | 550 | 438 | 544 | 396 | 439 | 184 |
Vec<u8>
.Vec<u8>
into all fields.Vec<u8>
into one field.Vec<u8>
.Benchmark source code is here.
If you can get away with compiling your data types then something like BinCode or Flatbuffers would actually be more performant than either Apache Avro or NoProto. Good luck with your simulations!
I've been doing work with very large avro datasets and comparing performance with existing Java implementation, originally performance was very poor, over 3x slower for the same workload. I implemented zero-copy deserialisation which helped substantially but performance was still 2x slower for the Rust impl. I have since modified my fork of avro-rs to treat Record field names as
Rc<String>
to stop a string allocation for each record row, this has provided a substantial performance boost to the point that the Rust implementation for my workload is now 30% faster than the Java.Is
Rc<String>
the best approach for this? My logic was that record field names are fixed at Schema load - would it be possible to remove theRc
completely in favour of&'schema str
?And is there any other places where there's scope for performance improvements?