Explore deduplicating manual unrolling of IndexOf{Any} operations

stephentoub commented 1 year ago

Our IndexOf{Any} (and some LastIndexOf{Any}) implementations all have a scalar path that's used when vectorization can't be, either because the current platform doesn't support it, the target type doesn't support it, or the length being searched is too small. Manual of these manually unroll the loop, e.g.

                nuint offset = 0;

                while (length >= 8)
                {
                    length -= 8;

                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset) == value)) goto Found;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 1) == value)) goto Found1;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 2) == value)) goto Found2;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 3) == value)) goto Found3;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 4) == value)) goto Found4;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 5) == value)) goto Found5;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 6) == value)) goto Found6;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 7) == value)) goto Found7;

                    offset += 8;
                }

                if (length >= 4)
                {
                    length -= 4;

                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset) == value)) goto Found;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 1) == value)) goto Found1;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 2) == value)) goto Found2;
                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 3) == value)) goto Found3;

                    offset += 4;
                }

                while (length > 0)
                {
                    length -= 1;

                    if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset) == value)) goto Found;

                    offset += 1;
                }
                return -1;
            Found7:
                return (int)(offset + 7);
            Found6:
                return (int)(offset + 6);
            Found5:
                return (int)(offset + 5);
            Found4:
                return (int)(offset + 4);
            Found3:
                return (int)(offset + 3);
            Found2:
                return (int)(offset + 2);
            Found1:
                return (int)(offset + 1);
            Found:
                return (int)(offset);

And as a result, we have a bunch of copies of code of that general structure.

We should be able to deduplicate many of them by using generic specialization, e.g. we create an interface like:

internal interface IIndexOfKernel<TValue>
{
    bool Equal(ref TValue haystack);
}

then write a shared routine for doing the operation based on this kernel:

internal static int IndexOf<TValue, TKernel>(TKernel kernel, ref TValue searchSpace, int length)
    where TKernel : IIndexOfKernel<TValue>
{
    ... // same as above, but using kernel.Equal
}

and then in the places we need it, creating a simple struct that provides its core operation as the implementation of Equal, e.g.

internal readonly struct SingleValueKernel<TValue> : IIndexOf<TValue>
{
    private readonly TValue _value;
    public SingleValueKernel(TValue value) => _value = value;
    public bool Equal(ref TValue haystack) => haystack == _value;
}

or something along those lines.

It would help us to deduplicate a bunch of code and also ensure we're consistently unrolling in the same manner across all of our various implementations.

We'd need to measure to understand the cost of this. There might be better ways to structure it as well, e.g. some operations could get away with static abstract interface methods, but others do need to carry around the state (e.g. for multiple values to compare against).

(Alternatively, maybe we could better teach the JIT to do this level of unrolling and eliminate it entirely from the code?)

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details

Our IndexOf{Any} (and some LastIndexOf{Any}) implementations all have a scalar path that's used when vectorization can't be, either because the current platform doesn't support it, the target type doesn't support it, or the length being searched is too small. Manual of these manually unroll the loop, e.g. ```C# nuint offset = 0; while (length >= 8) { length -= 8; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset) == value)) goto Found; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 1) == value)) goto Found1; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 2) == value)) goto Found2; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 3) == value)) goto Found3; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 4) == value)) goto Found4; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 5) == value)) goto Found5; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 6) == value)) goto Found6; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 7) == value)) goto Found7; offset += 8; } if (length >= 4) { length -= 4; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset) == value)) goto Found; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 1) == value)) goto Found1; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 2) == value)) goto Found2; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset + 3) == value)) goto Found3; offset += 4; } while (length > 0) { length -= 1; if (TNegator.NegateIfNeeded(Unsafe.Add(ref searchSpace, offset) == value)) goto Found; offset += 1; } return -1; Found7: return (int)(offset + 7); Found6: return (int)(offset + 6); Found5: return (int)(offset + 5); Found4: return (int)(offset + 4); Found3: return (int)(offset + 3); Found2: return (int)(offset + 2); Found1: return (int)(offset + 1); Found: return (int)(offset); ``` And as a result, we have a bunch of copies of code of that general structure. We should be able to deduplicate many of them by using generic specialization, e.g. we create an interface like: ```C# internal interface IIndexOfKernel { bool Equal(ref TValue haystack); } ``` then write a shared routine for doing the operation based on this kernel: ```C# internal static int IndexOf(TKernel kernel, ref TValue searchSpace, int length) where TKernel : IIndexOfKernel { ... // same as above, but using kernel.Equal } ``` and then in the places we need it, creating a simple struct that provides its core operation as the implementation of Equal, e.g. ```C# internal readonly struct SingleValueKernel : IIndexOf { private readonly TValue _value; public SingleValueKernel(TValue value) => _value = value; public bool Equal(ref TValue haystack) => haystack == _value; } ``` or something along those lines. It would help us to deduplicate a bunch of code and also ensure we're consistently unrolling in the same manner across all of our various implementations. We'd need to measure to understand the cost of this. There might be better ways to structure it as well, e.g. some operations could get away with static abstract interface methods, but others do need to carry around the state (e.g. for multiple values to compare against). (Alternatively, maybe we could better teach the JIT to do this level of unrolling and eliminate it entirely from the code?)

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Runtime`, `untriaged`, `needs-area-label`
Milestone:	-

stephentoub commented 1 year ago

I'm thinking of something like this: https://github.com/dotnet/runtime/compare/main...stephentoub:runtime:consolidateunrolling

@jkotas, @tannergooding, @MihaZupan, @EgorBo, if perf tests end up looking ok, any concerns with doubling-down on generics like this even further?

If we go ahead with this, we could do something similar with many of the Vector128/Vector256 IndexOfXx code paths as well.

EgorBo commented 1 year ago

I like the fact the proposed change removes more code than it adds, but I probably need more coffee to start understanding what it does 😄

stephentoub commented 1 year ago

Basically we have a bunch of places implementing various IndexOf variations, some of which unroll manually, some of which don't, and when they do unroll there are often subtle differences to how it's done for little apparent reason. This is an attempt to make it consistent, with shared boilerplate they all use and just plug in the single equality check that differs between them.

(Of course, if there were a way for the JIT to just do similar unrolling automatically, we could instead delete all of this.)

EgorBo commented 1 year ago

Thanks, I understand the intention, I meant the actual implementation

(Of course, if there were a way for the JIT to just do similar unrolling automatically, we could instead delete all of this.)

jkotas commented 1 year ago

if perf tests end up looking ok, any concerns with doubling-down on generics like this even further?

What does perf tests looking ok mean? I expect that this is going to regress Mono performance (with interpreter in particular), reference type instantiations performance (limitations around shared generic code optimizations), and static footprint (more generic types instantiated for each T). The question is then going to be whether these regressions are worth taking to eliminate the code duplication.

dotnet / runtime

Explore deduplicating manual unrolling of IndexOf{Any} operations #83913