accelerate line iterator

pascalkuthe commented 7 months ago

ropeys line iterator doesn't do so well compared to other ropes and I always wondered why. My recent work on a streaming regex engine made clear to me why: The string search routines we use for finding lines are not well optimized.

There are a couple of reasons I found (there may ben more):

The reveres search did not use any simd at all
The forward search does use simd (line_to_byte_idx) but these functions from str_indices are more optimized for counting/larger counts and don't do quite as well in the latency department (when line_idx=1)
ropey support unicode line breaks (just turning those off makes a huge difference)

I optimized the str utilities used by the line iterator to adress the first two and somewhat the last point (nothing else uses these so nothing else is affected). memchr can be used to accelerate both forward and reverse searches. This was sadly not possible for the unicode lines case because memchr can only search for a set of up to three bytes. The aoh-corasic crate contains vectorized implementations that would allow accelerating these cases, but I didn't want to pull that in.

Some data, run a benchmark with:

cargo bench _lines --no-default-features --features "simd" on the current master branch
cargo bench _lines --no-default-features --features "simd,memchr" on this branch

Results:

iter_prev_lines/lines   time:   [34.199 ns 34.259 ns 34.320 ns] (was 51.73ns)
                        change: [-33.702% -33.533% -33.322%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
iter_prev_lines/lines_tiny
                        time:   [21.990 ns 22.012 ns 22.036 ns] (was 37.55ns)
                        change: [-41.540% -41.456% -41.369%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

iter_next_lines/lines   time:   [34.450 ns 34.552 ns 34.645 ns] (was 39.25ns)
                        change: [-13.055% -12.513% -12.145%] (p = 0.00 < 0.05)
                        Performance has improved.
iter_next_lines/lines_tiny
                        time:   [29.280 ns 29.383 ns 29.538 ns] (was 35.79ns)
                        change: [-17.247% -16.909% -16.571%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)

memchr is behind a feature flag because it feels unfortunate to have a dependency not used by the default feature flag. Sadly you can't make a feature flag turn off a dependency so this hack will have to do. I think unicode line endings are one of things that are pedantically correct, but I suspect the majority of users will turn off and benefit from these optimizations.

cessen commented 7 months ago

So, first off, I don't plan to accept PRs that introduce additional dependencies. If anything, I want to reduce dependencies (e.g. remove smallvec). I apologize for any wasted effort due to that. I'll add that to the contributing section of the readme to make it clear for future contributors.

I don't think this PR is completely wasted effort, however.

The lack of optimization on reverse line iteration is already known, but it's good to see some hard numbers on how much faster it could be with SIMD. The way I'd like to go about optimizing that case is to add functions to the str_indices crate that handle reverse (from-the-end) index conversion. I just haven't been motivated enough myself to tackle that yet.

The forward search does use simd (line_to_byte_idx) but these functions from str_indices are more optimized for counting/larger counts and don't do quite as well in the latency department (when line_idx=1)

This is a great thing to identify, and like the reverse iteration case I'd like to see this tackled in str_indices rather than bringing in another dependency. We can probably add a special case in line_to_byte_idx for line_idx=1 to do something faster, since that's a common case that's likely worth optimizing for.

pascalkuthe commented 7 months ago

alright I can respect that decision. I will probably not have the time to implement that right now since writing vectorization routines is a lot more work than what I did here. But I guess it's a proof of concept and some smaller optimizations I did here that don't require memchr/could be reused with the new routines once they are added to str_indices). For now I will leave this up as a draft PR that can be referenced in future work.

slightly related: The reason I actually got around to this is that I was benchmarking bytecount (for computing char offsets so similar but not the same thing) against str_indices and found that bytecount both had less latency and faster throughput. Particularly, if you turn on runtime dispatch the latency was a bit hgither (but still lower than str_indices) but the throughput was up to 50% higher. I suspect (but didn't get around to benchmarking) that their line counting functions are also a lot faster (since that is the main claim to fame of their crate). They can only count a single bytes so this wouldn't work for unicode lines but again it could be a lot faster for the non-unicode case where you only need to count \n. So once I do get around to looking at str_indices maybe we can try to close the gap there aswell.

cessen / ropey

accelerate line iterator #94