Closed sdmaclea closed 3 years ago
Instead to use Jit Intrisics or native code will be better to implement it using SSE / AVX copies directly using this work when it will be ready: https://github.com/dotnet/corefx/issues/22940
better to implement it using SSE / AVX copies directly using this work when it will be ready: dotnet/corefx#22940
@fanoI That is a amd64 solution. I was hoping for a more generic memmove
intrinsic. The memmove
intrinsic could certainly be implemented in term of SSE / AX for intel, but It should be an abstract platform agnostic intrinsic here.
I think this would be consistent with @AndyAyersMS work to add [Intrinsic]
support in CoreCLR
What would this intrinsic expand into?
I was think something like JIT_MemCpy
but without the alignment constraints because it is known to have no GCRefs.
That's not really intrinsic. It would be an FCall. E.g. it is how Buffer.Memcpy is implemented on Windows ARM for historic reasons (https://github.com/dotnet/coreclr/blob/master/src/vm/arm/memcpy.asm). The problem with asm implementation like these is:
They are constantly outdated
I think to some extent System.Buffer:Memmove
already has the same issue, but it is managed code, so you cannot easily tweak it.
Perhaps the alternative would be a templated block cpy intrinsic, it would solve many of the problems of System.Buffer:Memmove
. Something like:
rawblockcpy<int sizeBytes, int ptrOffset, bool ptrIncrement>(*src, *dst)
.
You could limit the size to 64/128/1024 bytes or something reasonable.
It could expand to an enhanced version of the genCodeForCpBlkUnroll
code. (Additions for offset and increment)
Pleae note that https://github.com/dotnet/corefx/issues/22940 will be done for ARM and any CPU that CLR will support in future so one can think of memmove_sse(), memmove_avx(), memmove_neon() and so on...
one can think of memmove_sse(), memmove_avx(), memmove_neon() and so on...
Exactly my concern. If they all could do the same thing using different implementations, then they should share the same name i.e. __memmove
.
@sdmaclea, with the hardware intrinsic support, rather than having to code a __memmove
intrinsic or FCALL directly into the runtime, we could do it directly in C#.
internal static void Memmove(byte* dest, byte* src, nuint len)
{
if (Avx.IsSupported)
{
// x86 AVX Implementation
}
else if (Sse.IsSupported)
{
// x86 SSE Implementation
}
else if (Neon.IsSupported)
{
// ARM Neon Implementation
}
else
{
// Software Implementation
}
}
This not only allows it to be tweaked and updated outside of a runtime update, it also has all of the code paths that aren't hit dropped by the JIT (so the SSE, Neon, and Software paths will get dropped as dead code on modern processors, and everything except for Neon will get dropped on ARM).
| This not only allows it to be tweaked and updated outside of a runtime update, it also has all of the code paths that aren't hit dropped by the JIT (so the SSE, Neon, and Software paths will get dropped as dead code on modern processors, and everything except for Neon will get dropped on ARM).
Of course __memove()
has all the same benefits except there is no burden on the C# programmer. The burden is on the intrinsic authors.
we could do it directly in C#
This is a blessing and a curse. If we do it in C# it should only be done once. Proliferation of this type of code in user space will kill portability of .Net. It should be limited to CoreCLR and maybe CoreFX.
It still may not generate code which could be hand crafted for __memmove.
Of course it also means we do not have to hand craft it.
I will work with whatever is deisgned, but modern compilers are offering generic intrinsics. Consider
gcc/clangs's __sync_add_and_fetch
et. al.
@russellhadley Has agreed to discuss this with me offline, so this does not have to be resolved here.
This is a blessing and a curse. If we do it in C# it should only be done once.
I meant we could codify the algorithm in C# using the new System.Runtime.Intrinsics
. The implementation would still live in CoreFX and would (likely) be centrally managed.
It still may not generate code which could be hand crafted for __memmove.
No, it might not be as ideal as hand-crafted assembly, but hand-crafted assembly requires it to live in CoreCLR, which means we can only get updates/improvements during the next Runtime release (or patch). It also means that we have to backport those fixes to other runtimes (Desktop, Mono, etc).
Of course __memove() has all the same benefits except there is no burden on the C# programmer. The burden is on the intrinsic authors
The burden of the implementation living in CoreFX or CoreCLR would fall on the same people regardless of where it is implemented. The difference is, are you coding using raw assembly or are you coding using intrinsic functions that map directly to SIMD instructions. Raw assembly might use movntdq
, C/C++ would use _mm_stream_si128
instead, and C# would use Sse2.StoreAlignedNonTemporal
.
The implementation would still live in CoreFX
The implementation would live in CoreLib in the partition shared between CoreCLR and CoreRT repos. In future, there can OOB version in corefx as well, but it would only make sense once the new intrinsics are supported outside of .NET Core.
modern compilers are offering generic intrinsics
Modern (C++) compilers are offering both generic intrinsics and platform specific intrinsics. We are on the same plan. E.g. System.Threading.Interlocked.Add
is equivalent of __sync_add_and_fetch
and it has intrinsic expansion.
Code in
SystemBuffer:Memmove
can be performance critical.Current code performance is limited by PInvoke overhead, limited optimization of the
Block16
andBlock64
struct copies dotnet/runtime#8872, and general managed to native limitationsIt may be better to move this to native code to get optimal performance
@dotnet/jit-contrib