dotnet / runtimelab

This repo is for experimentation and exploring new ideas that may or may not make it into the main dotnet/runtime repo.
MIT License
1.41k stars 197 forks source link

[NativeAOT-LLVM] Vector.IsHardwareAccelerated true? #2515

Open jasonthorsness opened 7 months ago

jasonthorsness commented 7 months ago

I understand that browsers mostly support WebAssembly SIMD and so does Emscripten

I am seeing Vector.IsHardwareAccelerated return false from my app compiled with

<PackageReference Include="Microsoft.DotNet.ILCompiler.LLVM; runtime.win-x64.Microsoft.DotNet.ILCompiler.LLVM" Version="9.0.0-*" />
dotnet publish -r browser-wasm -c Release /p:MSBuildEnableWorkloadResolver=false --self-contained /p:NativeDebugSymbols=false /p:EmccExtraArgs="-s EXPORTED_FUNCTIONS=""[_malloc,_Answer]"" -s EXPORTED_RUNTIME_METHODS=cwrap --post-js=run.js"

Is this expected at this time? Thanks!

SingleAccretion commented 7 months ago

Is this expected at this time?

Yes, we don't support SIMD yet.

jasonthorsness commented 7 months ago

Well I tried the same code in Blazor AOT which supposedly supports SIMD and it's not any faster; they must not support these operations - on my system this is 120ms natively compiled, 2400 ms Blazor, 1600 ms NativeAOT-LLVM

    public class Class1
    {
        [UnmanagedCallersOnly(EntryPoint = "Alloc")]
        public static unsafe byte* Alloc(int length)
        {
            return (byte*)NativeMemory.AlignedAlloc((nuint)length, (nuint)Vector<byte>.Count);
        }

        [UnmanagedCallersOnly(EntryPoint = "Answer")]
        public static unsafe int Answer(byte* f, int l)
        {
            for (int i = 0; i < 10000; ++i)
            {
                for (byte* ptr = f; ptr != f + l; ptr += Vector<byte>.Count)
                {
                    (~Vector.LoadAligned(ptr)).StoreAligned(ptr);
                }
            }

            return Vector<byte>.Count + (Vector.IsHardwareAccelerated ? 100 : 1000);
        }
    }

Blazor used:

    <RunAOTCompilation>true</RunAOTCompilation>
    <WasmEnableSIMD>true</WasmEnableSIMD>

Would it be straightforward to link in a C or C++ file with SSE2 intrinsics and have Emscripten translate it? Any examples? (sorry this doesn't seem appropriate for issue; not sure where else to discuss/ask questions)

SingleAccretion commented 7 months ago

Would it be straightforward to link in a C or C++ file with SSE2 intrinsics and have Emscripten translate it? Any examples?

With NativeAOT-LLVM, you would first need to compile the native code into a native library. For the case of a single .c file, it can be as simple as:

; See https://emscripten.org/docs/porting/simd.html#compiling-simd-code-targeting-x86-sse-instruction-sets for SSE compatibility flags.
emcc -msimd128 -c lib.c -O2 -o lib.o

<NativeLibrary Include="lib.o" /> ; Statically linked code, use direct PInvoke to invoke it.

You do need to use a matching version of Emscripten, however.

https://learn.microsoft.com/en-us/aspnet/core/blazor/webassembly-native-dependencies?view=aspnetcore-8.0 is the documentation for how to do the same using the upstream toolchain - it supports compiling source files directly.

jasonthorsness commented 7 months ago

Just wanted to note; this works great - same test above using the WASM SIMD functions directly is only ~330 ms which seems expected; the natively-compiled version code is twice as fast (likely because it gets to use 256-bit vectors on my machine instead of 128-bit) and the WASM SIMD version is roughly 4 times faster than the Vector version.

In case anyone sees this I just put this in my project file:

  <ItemGroup>
    <DirectPInvoke Include="lib" />
    <NativeLibrary Include="lib.o" />
  </ItemGroup>

  <Target Name="CompileNativeLibrary" BeforeTargets="BeforeBuild">
    <Exec Command="emcc -msimd128 -c lib.c -O2 -o lib.o" />
  </Target>

Then in the code

        [LibraryImport("lib")]
        internal static unsafe partial void bar(byte* ptr, int n);

And for this test lib.c file is just this:

#include <stddef.h>
#include <wasm_simd128.h>

void bar(uint8_t* ptr, int length) {
    v128_t* simd_ptr = (v128_t*)ptr;
    size_t num_vectors = length / sizeof(v128_t);
    v128_t ones = wasm_i32x4_splat(~0);
    for (size_t i = 0; i < num_vectors; ++i) {
        v128_t current_vector = wasm_v128_load(simd_ptr + i);
        v128_t inverted_vector = wasm_v128_xor(current_vector, ones);
        wasm_v128_store(simd_ptr + i, inverted_vector);
    }
}