Optimization for large loops.

TheBlackPlague commented 2 years ago

Description

int[] x = new int[N];

for (int i = 0; i < x.Length; i++) x[i]++;

Consider the code above. With the current JITting, this code gets assembled to N iterations of the loop. However, given how simple the operation is, there is potential for two things (among others) to speed up this loop's performance: loop unrolling & vectorization.

Consider just the loop part where N = 10000, and x = int[N]:

for (int i = 0; i < N; i++) x[i]++;

The following assembly is generated by RyuJIT:

       push     rbp
       mov      rbp, rsp
       xor      eax, eax
       test     rdi, rdi
       je       SHORT G_M31042_IG05
       cmp      dword ptr [rdi+8], 0x2710
       jl       SHORT G_M31042_IG05
G_M31042_IG03:
       movsxd   rsi, eax
       lea      rsi, bword ptr [rdi+4*rsi+16]
       inc      dword ptr [rsi]
       inc      eax
       cmp      eax, 0x2710
       jl       SHORT G_M31042_IG03
       jmp      SHORT G_M31042_IG06
G_M31042_IG05:
       cmp      eax, dword ptr [rdi+8]
       jae      SHORT G_M31042_IG07
       movsxd   rsi, eax
       lea      rsi, bword ptr [rdi+4*rsi+16]
       inc      dword ptr [rsi]
       inc      eax
       cmp      eax, 0x2710
       jl       SHORT G_M31042_IG05
G_M31042_IG06:
       pop      rbp
       ret      
G_M31042_IG07:
       call     [CORINFO_HELP_RNGCHKFAIL]
       int3

Same code in C++:

for (int i = 0; i < N; i++) x[i]++;

Assembly generated by GCC:

        movdqa  xmm1, XMMWORD PTR .LC0[rip]
        lea     rax, [rdi+40000]
.L2:
        movdqu  xmm0, XMMWORD PTR [rdi]
        add     rdi, 16
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi-16], xmm0
        cmp     rax, rdi
        jne     .L2
        ret
.LC0:
        .long   1
        .long   1
        .long   1
        .long   1

If I use x.Length instead of N...

C# code:

int length = x.Length;
for (int i = 0; i < length; i++) x[i]++;

Generated assembly:

       push     rbp
       mov      rbp, rsp
       mov      eax, dword ptr [rdi+8]
       xor      esi, esi
       test     eax, eax
       jle      SHORT G_M31042_IG06
       test     eax, eax
       jl       SHORT G_M31042_IG05
G_M31042_IG03:
       movsxd   rdx, esi
       lea      rdx, bword ptr [rdi+4*rdx+16]
       inc      dword ptr [rdx]
       inc      esi
       cmp      esi, eax
       jl       SHORT G_M31042_IG03
       jmp      SHORT G_M31042_IG06
G_M31042_IG05:
       movsxd   rdx, esi
       lea      rdx, bword ptr [rdi+4*rdx+16]
       inc      dword ptr [rdx]
       inc      esi
       cmp      esi, eax
       jl       SHORT G_M31042_IG05
G_M31042_IG06:
       pop      rbp
       ret

Now, I understand that the .NET Runtime aims to be safer (ensuring that the index isn't out of range, etc.), but it seems that there is a clear performance winner. In both cases, the code is just as safe. We should be generating performant assembly for such loops. They're used everywhere, in almost every project. Optimized loops would mean a massive speed boost for all applications.

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

### Description ```cs int[] x = new int[N]; for (int i = 0; i < x.Length; i++) x[i]++; ``` Consider the code above. With the current JITting, this code gets assembled to N iterations of the loop. However, given how simple the operation is, there is potential for two things (among others) to speed up this loop's performance: loop unrolling & vectorization. Consider just the loop part where `N = 10000`, and `x = int[N]`: ```cs for (int i = 0; i < N; i++) x[i]++; ``` The following assembly is generated by RyuJIT: ```asm push rbp mov rbp, rsp xor eax, eax test rdi, rdi je SHORT G_M31042_IG05 cmp dword ptr [rdi+8], 0x2710 jl SHORT G_M31042_IG05 G_M31042_IG03: movsxd rsi, eax lea rsi, bword ptr [rdi+4*rsi+16] inc dword ptr [rsi] inc eax cmp eax, 0x2710 jl SHORT G_M31042_IG03 jmp SHORT G_M31042_IG06 G_M31042_IG05: cmp eax, dword ptr [rdi+8] jae SHORT G_M31042_IG07 movsxd rsi, eax lea rsi, bword ptr [rdi+4*rsi+16] inc dword ptr [rsi] inc eax cmp eax, 0x2710 jl SHORT G_M31042_IG05 G_M31042_IG06: pop rbp ret G_M31042_IG07: call [CORINFO_HELP_RNGCHKFAIL] int3 ``` Same code in C++: ```cpp for (int i = 0; i < N; i++) x[i]++; ``` Assembly generated by GCC: ```asm movdqa xmm1, XMMWORD PTR .LC0[rip] lea rax, [rdi+40000] .L2: movdqu xmm0, XMMWORD PTR [rdi] add rdi, 16 paddd xmm0, xmm1 movups XMMWORD PTR [rdi-16], xmm0 cmp rax, rdi jne .L2 ret .LC0: .long 1 .long 1 .long 1 .long 1 ``` If I use `x.Length` for both cases instead of `N`... C# code: ```cs int length = x.Length; for (int i = 0; i < length; i++) x[i]++; ``` Generated assembly: ```asm push rbp mov rbp, rsp mov eax, dword ptr [rdi+8] xor esi, esi test eax, eax jle SHORT G_M31042_IG06 test eax, eax jl SHORT G_M31042_IG05 G_M31042_IG03: movsxd rdx, esi lea rdx, bword ptr [rdi+4*rdx+16] inc dword ptr [rdx] inc esi cmp esi, eax jl SHORT G_M31042_IG03 jmp SHORT G_M31042_IG06 G_M31042_IG05: movsxd rdx, esi lea rdx, bword ptr [rdi+4*rdx+16] inc dword ptr [rdx] inc esi cmp esi, eax jl SHORT G_M31042_IG05 G_M31042_IG06: pop rbp ret ``` Meanwhile, C++: ```cpp int length = sizeof(x) / sizeof(x[0]); for (int i = 0; i < length; i++) x[i]++; ``` Generated assembly: ```asm movq xmm1, QWORD PTR .LC0[rip] movq xmm0, QWORD PTR [rdi] paddd xmm0, xmm1 movq QWORD PTR [rdi], xmm0 ret .LC0: .long 1 .long 1 ``` ### Analysis Now, I understand that the .NET Runtime aims to be safer (ensuring that the index isn't out of range, etc.), but it seems that there is a clear performance winner. In both cases, the code is just as safe. We should be generating performant assembly for such loops. They're used everywhere, in almost every project. Optimized loops would mean a massive speed boost for all applications.

Author:	TheBlackPlague
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	-

JulieLeeMSFT commented 2 years ago

cc @BruceForstall @dotnet/jit-contrib.

EgorBo commented 2 years ago

The Loops in you examples are already pretty efficient as is (well, there are some things we might improve), for the case where you use 1000 the codegen looks massive but if you look closer you will noticed two loops ("loop clonning" optimization kicked in) and one of them is pretty fast.

We have existing feature requests for partial unrolling and vectorization but we're not really interested in them at this point - most of BCL APIs are already vectorized by hands. Autovectorization is usually pretty fragile even in C++ compilers and it easily stops working in something more complicated than incrementing all elements. If you provide some insights/investigations on how many loops in real world application will be improved and by how much we might consider giving it higher priority

BruceForstall commented 2 years ago

@TheBlackPlague Thanks for the issue. We know there are many classical compiler loop optimizations that we are missing, and we're trying to implement the most impactful ones (to "real-world" C# code) as resources allow. As EgorBo states, partial unrolling and auto-vectorization haven't met those criteria yet.

I'm going to go ahead and close this, since we're well aware of opportunities for improvement in this area.

TheBlackPlague commented 2 years ago

@BruceForstall & @EgorBo, I would like to discuss this further if possible. I will agree to the final decision made by you guys as you're the team behind this project.

Can we collectively agree that C# can be a valuable language for machine learning? I know there are languages like Python that have a lot of ML frameworks in them, but I imagine that Microsoft does intend to make C# a language that has potential for both model inference & training.

Can we collectively agree that machine learning in C# is "real-world" C# code? While it's super convenient to do training on the GPU, some models are written to be used on the CPU. For that reason, having access to SIMD is very important, as inference would be slow without them. We've all seen and heard that neural network models work with floats (float32/float64) to maintain their high accuracy; however, in some instances where complete accuracy isn't necessary, experienced ML researchers will use Quantization. See: Medium Article on Quantization in Deep Learning

The rough idea is to convert floating point values to values like integers, shorts, and even sbytes. One good reason is that integer operations are usually faster on the CPU. Another good reason is the size of the data.

On AVX-2 (not going to talk about AVX-512 right now as it isn't supported in .NET yet)... the Vector register size is 256-bits (or, for convenience: 32 bytes). Thus, in parallel, one can operate on eight integers, 16 shorts (int16), and 32 sbytes (int8). The cost is roughly the same as doing the operation once on that type.

The point is that yes, one can write their Vectorization code, but not every ML Researcher will know about vectorization and how to use it properly. Moreover, not everyone will know about each architecture (AVX or SSE)'s specific instructions to optimize their algorithms (or pipelines) further.

Other than that, many more cases, such as games in C#, could use vectorization. In certain games, immense amounts of calculations need to be done at the same time. We talk about how Unity should switch over to .NET Core and give users access to the new APIs, but why would they when .NET Core's JIT compiler performs worse than whatever they've built? Not to mention, not every aspiring game engine developer would want to work in C# and C++. Furthermore, not every new game developer will know about vectorization.

No offense, but I believe it's incredibly incorrect to think the only "real-world" use-case of C# is for databases and APIs. C# is a beautiful language with the best syntax I've seen in a language - precise, concise, and expressive. If you look outside the bubble of databases and APIs, you'll see people working tirelessly to create marvelous things in this language. All to at times be beaten by C++'s performance.

We should be working towards making C# as fast as C++, and if not C++ due to safety requirements, at least Rust should be a feasible target. I understand due to the way JIT is supposed to run (portable and minimal runtime overhead), it's hard to do multiple passes that compilers like GCC do. It's why I applaud everyone who has worked on RyuJIT. It's amazing. However, with the NativeAOT compilation right around the corner, isn't this something that can happen there? For those that require said performance improvements?

I urge you to consider this.

BruceForstall commented 2 years ago

@TheBlackPlague Thank you for sharing your thoughts. Let me be clear: we want to continue to improve RyuJIT and .NET code generation in general to provide better and better performance, as well as making continuous improvements in the core libraries as well as C# libraries like ML.NET to also improve .NET performance. If you haven't seen them before, check out Stephen Toub's per-release performance improvement summary blog posts, e.g., https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-6/.

We'll soon be starting planning for the work we'll choose to invest in for .NET 8. Loop, vectorization, and vector intrinsics improvements will be top-of-mind in those discussions. (You might be interested to see the start of AVX-512 support appearing: https://github.com/dotnet/runtime/issues/73262.)

So far, with respect to vectorization, we've focused on providing the solid cross-platform underpinnings that allow us to generate these instructions and have optimized core libraries routines to use them. We expect performance-sensitive shared library writers will generally have more incentive to optimize their code in this way than general app developers. To make it easier for all, we have Vector<T>, and there have been some discussions about how to improve that by providing more generality and better performance for more scenarios.

TheBlackPlague commented 2 years ago

@BruceForstall Thank you very much for your response. I believe .NET 8 would be the right target for this. Knowing that it'll be one of the top things you consider for it is enough to satisfy me.

Are these discussions going to be internal or is there some place a fellow like myself can tune in and possibly participate?

Apologies if this is a bit off-topic.

danmoseley commented 2 years ago

it's incredibly incorrect to think the only "real-world" use-case of C# is for databases and APIs.

Nobody on this team is saying this.

Moreover, not everyone will know about each architecture (AVX or SSE)'s specific instructions to optimize their algorithms (or pipelines) further.

As mentioned above we continue to invest heavily in enabling cross platform vectorized operations, including most recently introducing Vector128 and Vector256, that allow you to write code for multiple architectures at the same time.

TheBlackPlague commented 2 years ago

it's incredibly incorrect to think the only "real-world" use-case of C# is for databases and APIs.

Nobody on this team is saying this.

Moreover, not everyone will know about each architecture (AVX or SSE)'s specific instructions to optimize their algorithms (or pipelines) further.

As mentioned above we continue to invest heavily in enabling cross platform vectorized operations, including most recently introducing Vector128 and Vector256, that allow you to write code for multiple architectures at the same time.

Thank you for responding.

I apologize if that's not what was meant, but that's what I inferred from the text.

If not auto-vectorization, can there be considerations on adding Vectorized addition methods to Span<T> if and when T is value? As in, something like int8, uint64, float64, etc.

While Vector64<T>/Vector128<T>/Vector256<T> are good, from what I can tell, they're not truly cross-platform. At least not Vector256<T> (only hardware accelerated on platforms supporting 256-bit vectors). Which is the main goal intended to be provided by .NET Core.

Vector<T> while truly being cross-platform isn't variable sized. It's limited to the size available on that runtime machine. This leads to manual code being written for logic rather than simplifying it away to easy instrinsic methods (which might even be faster).

Adding support to Span<T> with proper loop unrolling & vectorization would be amazing.

The APIs I'm proposing for Span<T> are something like the following method:

void Add<T>(Span<T> a, Span<T> b, Span<T> result) {
    // Code to unroll loop or variable size span and perform Vectorized addition. Possibly intrinsic and filled in by the JIT?
}

These APIs would again only work the types I mentioned. All in all, it would be a great addition to the Span API and give us all we need for simple auto-vectorization support. Afterall, auto-vectorization won't cover everything.

danmoseley commented 2 years ago

@tannergooding

BruceForstall commented 2 years ago

Are these discussions going to be internal or is there some place a fellow like myself can tune in and possibly participate?

Our planning discussions are typically internal. We've thought about ways to make them external, but haven't done so. However, our plans are published as GitHub issues and/or projects.

dotnet / runtime

Optimization for large loops. #74171

Description