Bytes parser performance

rukai commented 2 years ago

This isnt really an issue for redis-protocol, but raising an issue here seemed like the easiest way to make a discussion on this topic.

I was experimenting with writing a parser for cql (cassandra sql) using nom and https://github.com/aembke/redis-protocol.rs/blob/main/src/nom_bytes.rs The experiment is available here https://github.com/rukai/cqlparser

I did the following:

Implemented the nom parser the simple way, just pass around an &[u8] and output an ast that allocates Strings everywhere.
Write benchmarks for this.
Ported my parser to use nom_bytes + Bytes.
Ran the benchmarks for both the old and new implementation and noticed significant performance regressions.

git clone https://github.com/rukai/cqlparser
git checkout main
cargo bench
git checkout convert_to_bytes
cargo bench

I can kind of see why this would occur, a Bytes would be more expensive to clone around than an &[u8] and nom does a lot of cloning going through all the combinators. Also maybe the logic in the Bytes Clone is preventing optimizations that normally keep nom performant?

Did you benchmark a Bytes and non-Bytes equivalent parser before settling on a Bytes parser? Might be worth investigating that if you havent already.

If you did observe performance issues when combining nom and Bytes and found workarounds I would be interested in hearing them. I noticed that redis-protocol tends to manually implement some stuff that nom does for you, is that related to this problem?

aembke commented 2 years ago

Hi @rukai,

Yeah this is something I noticed too. In fact, the decode_mut parser function you're using in shotover (and the same one I use in fred) doesn't use Bytes at all with the nom interface for a few reasons, this being one of them.

The "main" decode_mut parser function, which is the one I'm interested in using and supporting down the road (with BytesMut as the input) does the following:

Immediately call as_bytes() on the BytesMut to get a slice view into the buffer.
Call the nom functions with inputs of (&[u8], usize) where the second value tracks the parsing offset into the slice.
Create fake frame structs which just hold the usize offsets (start, end) into the buffer when they detect a frame. However, when it parses primitive values that can go entirely on the stack (i64, etc), then it just stores that in the "fake frame".
At the very end once the final ending offset into the original BytesMut is known, perform one split and freeze on the BytesMut.
Do a depth-first traversal of the nested "fake" frames + the now frozen Bytes to convert to the "real" frames. At this point we're operating on one Bytes view that holds the entire set of nested frames in the buffer, so it's a zero-copy operation to slice it. That slice operation is still perhaps slower than slicing an actual slice, but it avoids moving or copying the data which is my real concern.

This approach was a lot more work to implement (I had to write the parser twice essentially, but I probably could have abstracted it better with the nom traits), but I think it strikes a good balance. It allows for zero-copy parsing and doesn't require using Bytes with nom, therefore avoiding the added overhead of the Bytes clone like you pointed out.

When I originally wrote 4.x I did the NomBytes interface first, then implemented the decode_mut feature interfaces separately from the ground up. I've been thinking more about this and I might either get rid of the non decode_mut interfaces for both the performance reasons you mentioned, and the fact that I don't think anybody really needs that. From what I've seen it's pretty ubiquitous to have BytesMut as your input to the parsing logic, but I could be wrong there. At a minimum I'll probably swap out the underlying implementation in the Bytes parsing logic to use &[u8] under the hood. Either way I'm not sure if anybody uses that (you and I both use the decode_mut interface at least), so we'll see.

At the moment though I really need to focus on fred so this is all stuff I likely wont get to for a few months.

rukai commented 2 years ago

Thanks for the in depth response! It sounds like such an approach would be even trickier for parsing CQL (a text language instead of a binary protocol) So I'm going to stick with allocating Strings for now but will definitely investigate your approach or something similar if parsing shows up when profiling.

aembke / redis-protocol.rs

Bytes parser performance #20