Investigate By-Ref Vector and Matrix Operators in C# 8.0

ilexp commented 6 years ago

Summary

As mentioned in issue #598, the new in parameter keyword is allowed on operators and using literal values, so operator implementations can supersede the previous static by-ref operator alternatives.

Analysis

Wait until the core has adopted C# 7.2.
Make sure using the in keyword implicitly with non-referencable values (like literals or property return values) has no negative impact on performance. Read the docs, and if that's not 100% clear, check out generated IL and x86 code in a sample project.
Remove all static by-ref methods from Vector2/3/4, Quaternion and Matrix3/4 if they have an operator equivalent.
Use the in keyword on parameters provided to the operators in question.

Barsonax commented 6 years ago

Also not sure if the current arithmetric implementation is the most optimal one. There is quite a big difference in the IL being generated. Take for example the vector3 + operator:

Source:

public static Vector3 operator +(Vector3 left, Vector3 right)
{
    left.X += right.X;
    left.Y += right.Y;
    left.Z += right.Z;
    return left;
}

IL:

  .method public hidebysig static specialname valuetype Duality.Vector3 
    op_Addition(
      valuetype Duality.Vector3 left, 
      valuetype Duality.Vector3 right
    ) cil managed 
  {
    .maxstack 8

    // [502 4 - 502 22]
    IL_0000: ldarga.s     left
    IL_0002: ldflda       float32 Duality.Vector3::X
    IL_0007: dup          
    IL_0008: ldind.r4     
    IL_0009: ldarg.1      // right
    IL_000a: ldfld        float32 Duality.Vector3::X
    IL_000f: add          
    IL_0010: stind.r4     

    // [503 4 - 503 22]
    IL_0011: ldarga.s     left
    IL_0013: ldflda       float32 Duality.Vector3::Y
    IL_0018: dup          
    IL_0019: ldind.r4     
    IL_001a: ldarg.1      // right
    IL_001b: ldfld        float32 Duality.Vector3::Y
    IL_0020: add          
    IL_0021: stind.r4     

    // [504 4 - 504 22]
    IL_0022: ldarga.s     left
    IL_0024: ldflda       float32 Duality.Vector3::Z
    IL_0029: dup          
    IL_002a: ldind.r4     
    IL_002b: ldarg.1      // right
    IL_002c: ldfld        float32 Duality.Vector3::Z
    IL_0031: add          
    IL_0032: stind.r4     

    // [505 4 - 505 16]
    IL_0033: ldarg.0      // left
    IL_0034: ret          

  } // end of method Vector3::op_Addition

Assembly:

 push        ebp  
 mov         ebp,esp  
 push        edi  
 push        esi  
 sub         esp,0Ch  
 lea         edx,[ebp-14h]  
 fld         dword ptr [ebp+14h]  
 fstp        dword ptr [edx]  
 fld         dword ptr [ebp+18h]  
 fstp        dword ptr [edx+4]  
 fld         dword ptr [ebp+1Ch]  
 fstp        dword ptr [edx+8]  
 lea         eax,[ebp-14h]  
 fld         dword ptr [ebp+8]  
 fadd        dword ptr [eax]  
 fstp        dword ptr [eax]  
 lea         eax,[ebp-10h]  
 fld         dword ptr [ebp+0Ch]  
 fadd        dword ptr [eax]  
 fstp        dword ptr [eax]  
 lea         eax,[ebp-0Ch]  
 fld         dword ptr [ebp+10h]  
 fadd        dword ptr [eax]  
 fstp        dword ptr [eax]  
 mov         edi,ecx  
 lea         esi,[ebp-14h]  
 movq        xmm0,mmword ptr [esi]  
 movq        mmword ptr [edi],xmm0  
 add         esi,8  
 add         edi,8  
 movs        dword ptr es:[edi],dword ptr [esi]  
 lea         esp,[ebp-8]  
 pop         esi  
 pop         edi  
 pop         ebp  
 ret         18h

And compare it to this:

Source:

public static Vector3 operator +(Vector3 left, Vector3 right)
{
    return new Vector3(left.X + right.X, left.Y + right.Y, left.Z + right.Z);
}

IL:

  .method public hidebysig static specialname valuetype Duality.Vector3 
    op_Addition(
      valuetype Duality.Vector3 left, 
      valuetype Duality.Vector3 right
    ) cil managed 
  {
    .maxstack 8

    // [502 4 - 502 76]
    IL_0000: ldarg.0      // left
    IL_0001: ldfld        float32 Duality.Vector3::X
    IL_0006: ldarg.1      // right
    IL_0007: ldfld        float32 Duality.Vector3::X
    IL_000c: add          
    IL_000d: ldarg.0      // left
    IL_000e: ldfld        float32 Duality.Vector3::Y
    IL_0013: ldarg.1      // right
    IL_0014: ldfld        float32 Duality.Vector3::Y
    IL_0019: add          
    IL_001a: ldarg.0      // left
    IL_001b: ldfld        float32 Duality.Vector3::Z
    IL_0020: ldarg.1      // right
    IL_0021: ldfld        float32 Duality.Vector3::Z
    IL_0026: add          
    IL_0027: newobj       instance void Duality.Vector3::.ctor(float32, float32, float32)
    IL_002c: ret          

  } // end of method Vector3::op_Addition

Assembly:

 push        ebp  
 mov         ebp,esp  
 push        eax  
 fld         dword ptr [ebp+14h]  
 fadd        dword ptr [ebp+8]  
 fstp        dword ptr [ebp-4]  
 fld         dword ptr [ebp-4]  
 fld         dword ptr [ebp+18h]  
 fadd        dword ptr [ebp+0Ch]  
 fstp        dword ptr [ebp-4]  
 fld         dword ptr [ebp-4]  
 fld         dword ptr [ebp+1Ch]  
 fadd        dword ptr [ebp+10h]  
 fstp        dword ptr [ebp-4]  
 fld         dword ptr [ebp-4]  
 fxch        st(2)  
 fstp        dword ptr [ecx]  
 fstp        dword ptr [ecx+4]  
 fstp        dword ptr [ecx+8]  
 mov         esp,ebp  
 pop         ebp  
 ret         18h

BenchmarkDotnet also shows that creating a new vector is alot faster than modifying the vector. Creating a small test project in duality with a stopwatch also confirms this. Might be worth looking in to and if you want to use the in keyword (readonly ref, C# 7.2) you would have to do this anyway.

ilexp commented 6 years ago

Can you provide the C# code for each of the IL code snippets?

To be really sure how this ends up on the CPU, we'll also need to compare the JITted assembly code, since a lot of optimizations only happen in the JIT stage.

BenchmarkDotnet also shows that creating a new vector is alot faster than modifying the vector.

What do you mean by creating a new vector vs. modifying..? Using the in keyword shouldn't change a lot in that regard. The result value is still a newly created vector?

Barsonax commented 6 years ago

Basically its this:

The current version (vector is the vector you get through the parameters)

vector.X = someCalculatedValue1;
vector.Y = someCalculatedValue2;
etc..
return vector;

vs this:

return new Vector(someCalculatedValue1, someCalculatedValue2, etc)

The last one is also compatible with the in keyword as you do not modify the vector you get in through the parameters (as in you can add the in keyword to the parameter without having to change code). It also seems that atleast in IL its alot shorter and measuring it shows its faster. Haven't checked Jitted code.

EDIT: added the source to my previous post

ilexp commented 6 years ago

The last one is also compatible with the in keyword as you do not modify the vector you get in through the parameters (as in you can add the in keyword to the parameter without having to change code)

Ah, sure. For the in keyword change that aspect of the source will have to change, yes. Could still be either a temporary vector assigned element by element, or a return + constructor combo though. The latter might be faster, though I would have assumed that it shouldn't make a difference after all optimizations fired.

Barsonax commented 6 years ago

Might be better memory locality since it doesnt have to go back to that vector between the calculations? Idk really I just noticed this when I looked at the IL and decided to measure it. Could be a good idea to try to reproduce it to see if that difference is there on your machine as well?

ilexp commented 6 years ago

Idk really I just noticed this when I looked at the IL and decided to measure it. Could be a good idea to try to reproduce it to see if that difference is there on your machine as well?

Good call. Not sure I'll get around to benchmark this myself, but maybe you can just post the C# file with your benchmarking code and the full BenchmarkDotNet results for reference? Should be enough of a statement for when we get back to this issue later.

Barsonax commented 6 years ago

I have added the assembly to the source and IL post. Here is the benchmark code:

    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkRunner.Run<VectorOperationBench>();
            Console.ReadKey();
        }
    }

    public class VectorOperationBench
    {
        private Vector3 left = new Vector3(4,7,8);
        private Vector3 right = new Vector3(1,2,3);

        [Benchmark]
        public Vector3 OldAdd()
        {
            return this.left + this.right;
        }

        [Benchmark]
        public Vector3 NewAdd()
        {
            return new Vector3(this.left.X + this.right.X, this.left.Y + this.right.Y, this.left.Z + this.right.Z);
        }
    }

These are the results I get on a i7-4790k:

On my laptop its slightly slower (slower cpu) in both cases but overal the same result.

ilexp commented 6 years ago

Okay, that really looks like there's an obvious perf gain from restructuring the operators that way. Very nice find 👍 If you're up for it, this could be a worthwhile master PR until we'll get around to deal with this issue / the in keyword stuff.

Barsonax commented 6 years ago

So first to be really sure this wasnt just some exception to the rule I added more benchmarks and measured the current performance:

Then I changed those operators and benchmarked them again:

So except for the division which is more or less the same its a very big improvement. I wonder why the compiler doesnt optimize this.

Also I fixed some trivial code duplication stuff (like vec float and float vec now use the same code instead of separate implementations that do the same thing for instance). Will do a PR once I did all the operators.

ilexp commented 6 years ago

Also I fixed some trivial code duplication stuff (like vec float and float vec now use the same code instead of separate implementations that do the same thing for instance). Will do a PR once I did all the operators.

Are you 100% sure that you can still multiply both ways, e.g. 2.5f * vec and vec * 2.5f even when defining only one of them? Please check before including this. I'm not sure the order of operands is reversed implicitly by the compiler here, would honestly expect it not to be.

Barsonax commented 6 years ago

I didnt removed them I changed the reversed variant to use the logic of the non reversed variant in order to reduce code duplication.

ilexp commented 6 years ago

I didnt removed them I changed the reversed variant to use the logic of the non reversed variant in order to reduce code duplication.

Ah, makes sense. If this is at the cost of performance, I would value performance over avoiding code duplication in this case, otherwise good call.

Barsonax commented 6 years ago

Ah, makes sense. If this is at the cost of performance, I would value performance over avoiding code duplication in this case, otherwise good call.

No I measured it and the performance is the same. It seems to inline them so there is no difference in performance.

Barsonax commented 6 years ago

Now my PR is merged aside from the much better performance when using these operators adding the in keyword is really trivial now and will require no other code changes: Current:

public static Vector3 operator -(Vector3 left, Vector3 right)
{
    return new Vector3(
        left.X - right.X, 
        left.Y - right.Y, 
        left.Z - right.Z);
}

New:

public static Vector3 operator -(in Vector3 left, in Vector3 right)
{
    return new Vector3(
        left.X - right.X, 
        left.Y - right.Y, 
        left.Z - right.Z);
}

This issue can be done as soon as we switch to C#7.2. Do remember to benchmark these to see if it improves performance. There might be a chance it actually reduces performance for smaller vectors!

Barsonax commented 5 years ago

There are cases where defensive copies might occur when using the in keyword. This can reduce the performance. Making the struct readonly will prevent this.

https://blogs.msdn.microsoft.com/seteplia/2018/03/07/the-in-modifier-and-the-readonly-structs-in-c/

However since the X, Y etc fields are part of the public API its basically a breaking change.

Barsonax commented 5 years ago

As a further enhancement one can add readonly to properties and methods etc now in C#8.0 to denote they do not change state: https://docs.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-8#readonly-members

(in C# 7.3 this was only possible on struct level)

This will enable the compiler to skip more defensive copies.

Barsonax commented 5 years ago

Now duality is using C# 7.3 this issue can get picked up at any time:

Add in keyword
Consider making Vectors readonly to reduce the amount of defensive copies that happen. This changes the public API though.

ilexp commented 5 years ago

Making vector or matrix types readonly would be quite a usability downgrade, since that would prevent field assignments. We can still avoid performance regressions though, if we use the new readonly members that C# 8.0 offers, to flag all non-mutating properties and methods as readonly.

With that in mind, I think it makes sense to defer this until we're done with #698 right up to C# 8.0.

Barsonax commented 5 years ago

Making vector or matrix types readonly would be quite a usability downgrade, since that would prevent field assignments

Agreed, one could also consider adding methods like WithX(float value) that returns a new vector for you with a different value for X. Basically the same that assigning the property would do but this would make it more clear when defensive copies happen and we can mark the Vector itself as readonly.

So this:

var vector = new Vector3(1,2,3);
vector.X = 6;
vector.Z = 6;
// Do something with vector

Changes to:

var vector = new Vector3(1,2,3);
var modifiedVector = vector.WithX(6)
                           .WithZ(6);
// Do something with modifiedVector

Slightly different meaning though but this does allow you to mark the entire vector as readonly.

ilexp commented 5 years ago

I acknowledge the technical possibility, but that just reads terrible and wrong 😄

(Also, ref / out support, additional copies and potentially other stuff I'm not even thinking about - just give me plain old data vectors)

Barsonax commented 5 years ago

I acknowledge the technical possibility, but that just reads terrible and wrong

Hmm I don't think its that bad though. However another reason not to implement my suggestion (yup I now disagree with myself :P) might be this: https://referencesource.microsoft.com/#System.Numerics/System/Numerics/Vector3_Intrinsics.cs,3e75f0b7820a3bf5

There might be a chance that we would want to switch to these vectors at some point in the future since the System Numerics Vectors are alot faster due to stuff like SIMD optimizations. Better to try and match that public API if possible.

ilexp commented 5 years ago

Yeah, I was eyeballing that possibility as well - definitely something to keep in mind.

AdamsLair / duality

Investigate By-Ref Vector and Matrix Operators in C# 8.0 #600

Summary

Analysis