Open pascalkuthe opened 7 months ago
So, first off, I don't plan to accept PRs that introduce additional dependencies. If anything, I want to reduce dependencies (e.g. remove smallvec
). I apologize for any wasted effort due to that. I'll add that to the contributing section of the readme to make it clear for future contributors.
I don't think this PR is completely wasted effort, however.
The lack of optimization on reverse line iteration is already known, but it's good to see some hard numbers on how much faster it could be with SIMD. The way I'd like to go about optimizing that case is to add functions to the str_indices
crate that handle reverse (from-the-end) index conversion. I just haven't been motivated enough myself to tackle that yet.
The forward search does use simd (line_to_byte_idx) but these functions from
str_indices
are more optimized for counting/larger counts and don't do quite as well in the latency department (when line_idx=1)
This is a great thing to identify, and like the reverse iteration case I'd like to see this tackled in str_indices
rather than bringing in another dependency. We can probably add a special case in line_to_byte_idx
for line_idx=1
to do something faster, since that's a common case that's likely worth optimizing for.
alright I can respect that decision. I will probably not have the time to implement that right now since writing vectorization routines is a lot more work than what I did here. But I guess it's a proof of concept and some smaller optimizations I did here that don't require memchr/could be reused with the new routines once they are added to str_indices). For now I will leave this up as a draft PR that can be referenced in future work.
slightly related: The reason I actually got around to this is that I was benchmarking bytecount (for computing char offsets so similar but not the same thing) against str_indices and found that bytecount both had less latency and faster throughput. Particularly, if you turn on runtime dispatch the latency was a bit hgither (but still lower than str_indices) but the throughput was up to 50% higher. I suspect (but didn't get around to benchmarking) that their line counting functions are also a lot faster (since that is the main claim to fame of their crate). They can only count a single bytes so this wouldn't work for unicode lines but again it could be a lot faster for the non-unicode case where you only need to count \n
. So once I do get around to looking at str_indices maybe we can try to close the gap there aswell.
ropeys line iterator doesn't do so well compared to other ropes and I always wondered why. My recent work on a streaming regex engine made clear to me why: The string search routines we use for finding lines are not well optimized.
There are a couple of reasons I found (there may ben more):
str_indices
are more optimized for counting/larger counts and don't do quite as well in the latency department (when line_idx=1)I optimized the str utilities used by the line iterator to adress the first two and somewhat the last point (nothing else uses these so nothing else is affected). memchr can be used to accelerate both forward and reverse searches. This was sadly not possible for the unicode lines case because
memchr
can only search for a set of up to three bytes. The aoh-corasic crate contains vectorized implementations that would allow accelerating these cases, but I didn't want to pull that in.Some data, run a benchmark with:
cargo bench _lines --no-default-features --features "simd"
on the current master branchcargo bench _lines --no-default-features --features "simd,memchr"
on this branchResults:
memchr
is behind a feature flag because it feels unfortunate to have a dependency not used by the default feature flag. Sadly you can't make a feature flag turn off a dependency so this hack will have to do. I think unicode line endings are one of things that are pedantically correct, but I suspect the majority of users will turn off and benefit from these optimizations.