dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Improve vectorization of MemoryExtensions.IndexOf(..., span) #60866

Closed stephentoub closed 2 years ago

stephentoub commented 3 years ago

Today MemoryExtensions.IndexOf(..., ReadOnlySpan<char>) is implemented essentially as a loop around an IndexOf for the first character in the pattern and then a SequenceEqual at that position for that pattern: https://github.com/dotnet/runtime/blob/c742923991c68808d55a9f7cfcd7796644fea44f/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L17-L57 This is good but not great. In particular it ends up breaking out of the vectorized search loop potentially much more frequently than is necessary and paying the startup overhead for the SequenceEqual each time.

There are better options. For example, the simple algorithm described in Algorithm 1: Generic SIMD searches concurrently for two characters rather than just one, which makes it much less likely to break out of the vectorized loop to early (this is what the Rust sliceslice crate uses; it's also what the Rust memchr crate uses (which is what's now used by its regex engine), though that memchr doesn't just use the first and last characters but instead picks two characters expected to have the lowest frequency of occurrence).

We could also special-case various sizes. The aforementioned memchr, for example, uses Rabin-Karp for small input strings and Two-Way for larger search strings.

cc: @tannergooding, @GrabYourPitchforks

ghost commented 3 years ago

Tagging subscribers to this area: @GrabYourPitchForks, @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details
Today `MemoryExtensions.IndexOf(..., ReadOnlySpan)` is implemented: https://github.com/dotnet/runtime/blob/c742923991c68808d55a9f7cfcd7796644fea44f/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs#L17-L57 essentially as a loop around an IndexOf for the first character in the pattern and then a SequenceEqual at that position for that pattern. This is good but not great. In particular it ends up breaking out of the vectorized search loop potentially much more frequently than is necessary and paying the startup overhead for the SequenceEqual each time. There are better options. For example, the simple algorithm described in [Algorithm 1: Generic SIMD](http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd) searches concurrently for two characters rather than just one, which makes it much less likely to break out of the vectorized loop to early (this is what the Rust sliceslice crate uses; it's also what the Rust memchr crate uses (which is what's now used by its regex engine), though that memchr doesn't just use the first and last characters but instead picks two characters expected to have the lowest frequency of occurrence). cc: @tannergooding, @GrabYourPitchforks
Author: stephentoub
Assignees: -
Labels: `area-System.Memory`, `tenet-performance`, `untriaged`
Milestone: 7.0.0
gfoidl commented 3 years ago

a loop around an IndexOf for the first character in the pattern and then a SequenceEqual at that position for that pattern

It's similar for byte, so while optimizing the char-case we should try to look at byte too.

En3Tho commented 3 years ago

Algorithm uri doesn't work for me @stephentoub

stephentoub commented 3 years ago

Algorithm uri doesn't work for me @stephentoub

Not sure why; it works ok for me.

vcsjones commented 3 years ago

@En3Tho

I'm guessing it's because the site is not available on HTTPS. Some browsers are doing trials of automatic HTTPS redirects. You can try a Web Archive copy of the page: https://web.archive.org/web/20210809014600/http://0x80.pl/articles/simd-strfind.html

EgorBo commented 2 years ago

Prototype: https://github.com/EgorBo/runtime-1/commit/2770de1141aa8b2f1329dc8d3ce10a18d0591ce3

Tested. Needs benchmarks and ARM64 support.

image

Performance is noticeably better in the worst case for the current algorithm, e.g. in this text it hits multiple words starting with 'a' till it finds the correct one while Wojciech's one easily avoids calling SequenceEqual for each of them and doesn't break SIMD-loop.