Proposal: Use JIT helper/intrinsic to implement SystemBuffer:Memmove

sdmaclea commented 7 years ago

Code in SystemBuffer:Memmove can be performance critical.

Current code performance is limited by PInvoke overhead, limited optimization of the Block16 and Block64 struct copies dotnet/runtime#8872, and general managed to native limitations

It may be better to move this to native code to get optimal performance

@dotnet/jit-contrib

fanoI commented 7 years ago

Instead to use Jit Intrisics or native code will be better to implement it using SSE / AVX copies directly using this work when it will be ready: https://github.com/dotnet/corefx/issues/22940

sdmaclea commented 7 years ago

better to implement it using SSE / AVX copies directly using this work when it will be ready: dotnet/corefx#22940

@fanoI That is a amd64 solution. I was hoping for a more generic memmove intrinsic. The memmove intrinsic could certainly be implemented in term of SSE / AX for intel, but It should be an abstract platform agnostic intrinsic here.

I think this would be consistent with @AndyAyersMS work to add [Intrinsic] support in CoreCLR

jkotas commented 7 years ago

What would this intrinsic expand into?

sdmaclea commented 7 years ago

I was think something like JIT_MemCpy but without the alignment constraints because it is known to have no GCRefs.

jkotas commented 7 years ago

That's not really intrinsic. It would be an FCall. E.g. it is how Buffer.Memcpy is implemented on Windows ARM for historic reasons (https://github.com/dotnet/coreclr/blob/master/src/vm/arm/memcpy.asm). The problem with asm implementation like these is:

They are constantly outdated. The C runtime memcpy - on Windows at least - gets constantly tweaked for latest processors
They tend to stall the GC for large blocks unless you insert GC probes into it - but then the implementation starts differing significantly from the C runtime version and it is pain to update.

sdmaclea commented 7 years ago

They are constantly outdated

I think to some extent System.Buffer:Memmove already has the same issue, but it is managed code, so you cannot easily tweak it.

Perhaps the alternative would be a templated block cpy intrinsic, it would solve many of the problems of System.Buffer:Memmove. Something like:

rawblockcpy<int sizeBytes, int ptrOffset, bool ptrIncrement>(*src, *dst).

You could limit the size to 64/128/1024 bytes or something reasonable.

It could expand to an enhanced version of the genCodeForCpBlkUnroll code. (Additions for offset and increment)

fanoI commented 7 years ago

Pleae note that https://github.com/dotnet/corefx/issues/22940 will be done for ARM and any CPU that CLR will support in future so one can think of memmove_sse(), memmove_avx(), memmove_neon() and so on...

sdmaclea commented 7 years ago

one can think of memmove_sse(), memmove_avx(), memmove_neon() and so on...

Exactly my concern. If they all could do the same thing using different implementations, then they should share the same name i.e. __memmove.

tannergooding commented 7 years ago

@sdmaclea, with the hardware intrinsic support, rather than having to code a __memmove intrinsic or FCALL directly into the runtime, we could do it directly in C#.

internal static void Memmove(byte* dest, byte* src, nuint len)
{
    if (Avx.IsSupported)
    {
        // x86 AVX Implementation
    }
    else if (Sse.IsSupported)
    {
        // x86 SSE Implementation
    }
    else if (Neon.IsSupported)
    {
        // ARM Neon Implementation
    }
    else
    {
        // Software Implementation
    }
}

This not only allows it to be tweaked and updated outside of a runtime update, it also has all of the code paths that aren't hit dropped by the JIT (so the SSE, Neon, and Software paths will get dropped as dead code on modern processors, and everything except for Neon will get dropped on ARM).

sdmaclea commented 7 years ago

| This not only allows it to be tweaked and updated outside of a runtime update, it also has all of the code paths that aren't hit dropped by the JIT (so the SSE, Neon, and Software paths will get dropped as dead code on modern processors, and everything except for Neon will get dropped on ARM).

Of course __memove() has all the same benefits except there is no burden on the C# programmer. The burden is on the intrinsic authors.

we could do it directly in C#

This is a blessing and a curse. If we do it in C# it should only be done once. Proliferation of this type of code in user space will kill portability of .Net. It should be limited to CoreCLR and maybe CoreFX.

It still may not generate code which could be hand crafted for __memmove. Of course it also means we do not have to hand craft it.

I will work with whatever is deisgned, but modern compilers are offering generic intrinsics. Consider gcc/clangs's __sync_add_and_fetch et. al.

@russellhadley Has agreed to discuss this with me offline, so this does not have to be resolved here.

tannergooding commented 7 years ago

This is a blessing and a curse. If we do it in C# it should only be done once.

I meant we could codify the algorithm in C# using the new System.Runtime.Intrinsics. The implementation would still live in CoreFX and would (likely) be centrally managed.

It still may not generate code which could be hand crafted for __memmove.

No, it might not be as ideal as hand-crafted assembly, but hand-crafted assembly requires it to live in CoreCLR, which means we can only get updates/improvements during the next Runtime release (or patch). It also means that we have to backport those fixes to other runtimes (Desktop, Mono, etc).

Of course __memove() has all the same benefits except there is no burden on the C# programmer. The burden is on the intrinsic authors

The burden of the implementation living in CoreFX or CoreCLR would fall on the same people regardless of where it is implemented. The difference is, are you coding using raw assembly or are you coding using intrinsic functions that map directly to SIMD instructions. Raw assembly might use movntdq, C/C++ would use _mm_stream_si128 instead, and C# would use Sse2.StoreAlignedNonTemporal.

jkotas commented 7 years ago

The implementation would still live in CoreFX

The implementation would live in CoreLib in the partition shared between CoreCLR and CoreRT repos. In future, there can OOB version in corefx as well, but it would only make sense once the new intrinsics are supported outside of .NET Core.

jkotas commented 7 years ago

modern compilers are offering generic intrinsics

Modern (C++) compilers are offering both generic intrinsics and platform specific intrinsics. We are on the same plan. E.g. System.Threading.Interlocked.Add is equivalent of __sync_add_and_fetch and it has intrinsic expansion.

dotnet / runtime

Proposal: Use JIT helper/intrinsic to implement SystemBuffer:Memmove #8903