AdamsLair / duality

a 2D Game Development Framework
https://adamslair.github.io/duality
MIT License
1.41k stars 287 forks source link

Improve Uniform Shader Variable Handling #566

Closed ilexp closed 7 years ago

ilexp commented 7 years ago

Summary

While investigating issue #219, it became apparent that the way Duality handles uniform shader variables on the core side is currently lacking. There are no efficient specialized data structures for this task, and there is no support for global shader variables besides a small set of builtin ones. The backend also updates variables too eagerly, causing redundant GL calls.

Analysis

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Current (relevant) API:

public void SetArray(string name, T[] value);
public void Set(string name, T value);

public T[] GetArray<T>(string name);
public T Get<T>(string name);

Potential API when going with a multi-overload design:

public void Set(string name, ContentRef<Texture> value);

public void Set(string name, float[] value);
public void Set(string name, Vector2[] value);
public void Set(string name, Vector3[] value);
public void Set(string name, Vector4[] value);
public void Set(string name, Matrix3[] value);
public void Set(string name, Matrix4[] value);
public void Set(string name, int[] value);
public void Set(string name, Point2[] value);
public void Set(string name, bool[] value);

public void Set(string name, float value);
public void Set(string name, Vector2 value);
public void Set(string name, Vector3 value);
public void Set(string name, Vector4 value);
public void Set(string name, Matrix3 value);
public void Set(string name, Matrix4 value);
public void Set(string name, int value);
public void Set(string name, Point2 value);
public void Set(string name, bool value);

public ContentRef<Texture> GetTexture(string name);

public float[] GetFloatArray(string name);
public Vector2[] GetVector2Array(string name);
public Vector3[] GetVector3Array(string name);
public Vector4[] GetVector4Array(string name);
public Matrix3[] GetMatrix3Array(string name);
public Matrix4[] GetMatrix4Array(string name);
public int[] GetIntArray(string name);
public Point2[] GetPoint2Array(string name);
public bool[] GetBoolArray(string name);

public float GetFloat(string name);
public Vector2 GetVector2(string name);
public Vector3 GetVector3(string name);
public Vector4 GetVector4(string name);
public Matrix3 GetMatrix3(string name);
public Matrix4 GetMatrix4(string name);
public int GetInt(string name);
public Point2 GetPoint2(string name);
public bool GetBool(string name);

Avoid, if it can be avoided. Worth investigating why the typeof(T) optimization doesn't kick in? Could also try to reduce API surface by removing all but the most relevant overloads.

Could be an opportunity for checking out System.Runtime.CompilerServices.Unsafe and see if it can be used in Duality without switching compilers or runtimes - this also means making sure the old NuGet package manager in Duality can handle it when it shows up as a dependency. Edit: On second thought, due to the potential to break compatibility, it might actually be better to do without this dependency for now. Try to solve this via oldschool optimization and API design.

Edit: More info on the typeof(T) optimization for further research as to why it isn't applied:

AndyAyersMS commented 7 years ago

@ilexp if you point me at a representative example where typeof(T) == typeof(X) left a lot of boxing behind, I'll take a deeper look.

ilexp commented 7 years ago

@AndyAyersMS Thanks for checking in! I'll try to keep it short:

The use case I'm trying to get optimized boils down to the following:

public T Get<T>(string key) where T : struct
{
    if (typeof(T) == typeof(float))
    {
        // Transform data
        float result = ...;
        return (T)(object)result;
    }
    else if (typeof(T) == typeof(...))
    {
        // ...
    }
    // ...
}

I would expect optimization for Get<float> to end up with something equivalent to:

public T Get<T>(string key) where T : struct
{
    // No type checks!
    float result = ...;
    // Crucial: No boxing either!
    return result;
}

However, this doesn't seem to happen - I get lots of boxing allocations instead that vanish when I use a specialized non-generic method overload. Edit: The unnecessary boxing part of this seems to trigger for certain structs only, definitely not for float as displayed above, see followup comment.

I'm currently investigating this in a separate project and have not managed to get the typeof(T) check optimization to kick in at all. I have uploaded my test project here if you want to take a look, but I'll summarize its contents below:

public static void Main(string[] args)
{
    Console.WriteLine("Begin");

    TestCaseA();
    TestCaseB();
    TestCaseC();
    TestCaseD<int>();
    TestCaseE<int>();

    Console.WriteLine("End");
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int TestCaseA()
{
    return 1;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int TestCaseB()
{
    if (true)
    {
        return 1;
    }
    else
    {
        return 2;
    }
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int TestCaseC()
{
    if (typeof(int) == typeof(int))
    {
        return 1;
    }
    else
    {
        return 2;
    }
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int TestCaseD<T>() where T : struct
{
    if (typeof(T) == typeof(T))
    {
        return 1;
    }
    else
    {
        return 2;
    }
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int TestCaseE<T>() where T : struct
{
    if (typeof(T) == typeof(int))
    {
        return 1;
    }
    else
    {
        return 2;
    }
}

To retrieve a JITed disassembly of each method, I'm building the project in Release mode, run it in Visual Studio, step into each method and switch to Disassembly. "Enable Just My Code" is false and "Suppress JIT Optimizations" is false as well. I'm getting the following disassembly for each method:

TestCaseA:

 mov         eax,1  
 ret  

TestCaseB:

 mov         eax,1  
 ret 

TestCaseC:

 sub         esp,20h  
 mov         rcx,7FF9BEDD9288h  
 call        00007FF9C062BBF0  
 mov         rsi,rax  
 mov         rcx,7FF9BEDD9288h  
 call        00007FF9C062BBF0  
 cmp         rax,rsi  
 jne         00007FF961020556  
 mov         eax,1  
 add         rsp,20h  
 pop         rsi  
 ret  
 mov         eax,2  
 add         rsp,20h  
 pop         rsi  
 ret 

TestCaseD:

 sub         esp,20h  
 mov         rcx,7FF9BEDD9288h  
 call        00007FF9C062BBF0  
 mov         rsi,rax  
 mov         rcx,7FF9BEDD9288h  
 call        00007FF9C062BBF0  
 cmp         rax,rsi  
 jne         00007FF9610005B6  
 mov         eax,1  
 add         rsp,20h  
 pop         rsi  
 ret  
 mov         eax,2  
 add         rsp,20h  
 pop         rsi  
 ret  

TestCaseE:

 sub         esp,20h  
 mov         rcx,7FF9BEDD9288h  
 call        00007FF9C062BBF0  
 mov         rsi,rax  
 mov         rcx,7FF9BEDD9288h  
 call        00007FF9C062BBF0  
 cmp         rax,rsi  
 jne         00007FF961000616  
 mov         eax,1  
 add         rsp,20h  
 pop         rsi  
 ret  
 mov         eax,2  
 add         rsp,20h  
 pop         rsi  
 ret  

I'll have to add that I'm not at all used to reading assembly code or debugging JIT optimizations, so if I took a wrong turn at some point, or you need additional information, please let me know. For the last test, I'm using Visual Studio 2017 (Version 15.2 (26430.15)) on .NET Framework Version 4.7.02046.

Edit: Added two more test cases. C to E seem to be equal except for memory addresses.

ilexp commented 7 years ago

@AndyAyersMS I did a second test that focuses on the boxing itself, rather than omitting / resolving typeof(T) checks. Test project is here, code and disassembly below:

public struct SmallStruct
{
    public int Value;

    public SmallStruct(int v)
    {
        this.Value = v;
    }
}
public struct BigStruct
{
    public int Value;
    public long Other;

    public BigStruct(int v, long o)
    {
        this.Value = v;
        this.Other = o;
    }
}
public struct GenericStruct<T>
{
    public int Value;
    public T Other;

    public GenericStruct(int v, T o)
    {
        this.Value = v;
        this.Other = o;
    }
}
public class Program
{
    public static void Main(string[] args)
    {
        Console.WriteLine("Begin");

        TestCaseA<SmallStruct>();
        TestCaseB<BigStruct>();
        TestCaseC<GenericStruct<long>>();
        TestCaseD<GenericStruct<string>>();

        Console.WriteLine("End");
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static T TestCaseA<T>() where T : struct
    {
        if (typeof(T) == typeof(SmallStruct))
        {
            return (T)(object)new SmallStruct(1);
        }
        else
        {
            return default(T);
        }
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static T TestCaseB<T>() where T : struct
    {
        if (typeof(T) == typeof(BigStruct))
        {
            return (T)(object)new BigStruct(1, 2);
        }
        else
        {
            return default(T);
        }
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static T TestCaseC<T>() where T : struct
    {
        if (typeof(T) == typeof(GenericStruct<long>))
        {
            return (T)(object)new GenericStruct<long>(1, 2);
        }
        else
        {
            return default(T);
        }
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static T TestCaseD<T>() where T : struct
    {
        if (typeof(T) == typeof(GenericStruct<string>))
        {
            return (T)(object)new GenericStruct<string>(1, "Hello");
        }
        else
        {
            return default(T);
        }
    }
}

Test case D is the one that I was stumbling over in my project, and looking at the disassembly, it is also the only one where boxing is not optimized away:

TestCaseA: Aside from the typeof check that's still there, no boxing with the small struct.

 sub         esp,20h  
 mov         rcx,7FF960F05B18h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 mov         rsi,rax  
 mov         rcx,7FF960F05B18h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 cmp         rax,rsi  
 jne         00007FF961010536  
 mov         eax,1  
 add         rsp,20h  
 pop         rsi  
 ret  
 xor         eax,eax  
 add         rsp,20h  
 pop         rsi  
 ret 

TestCaseB: No boxing with the big struct either

 push        rdi  
 push        rsi  
 sub         rsp,28h  
 mov         rsi,rcx  
 mov         rcx,7FF960F05C40h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 mov         rdi,rax  
 mov         rcx,7FF960F05C40h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 cmp         rax,rdi  
 jne         00007FF9610105A9  
 mov         eax,1  
 mov         edx,2  
 mov         dword ptr [rsi],eax  
 mov         qword ptr [rsi+8],rdx  
 mov         rax,rsi  
 add         rsp,28h  
 pop         rsi  
 pop         rdi  
 ret  
 xor         eax,eax  
 xor         edx,edx  
 mov         dword ptr [rsi],eax  
 mov         qword ptr [rsi+8],rdx  
 mov         rax,rsi  
 add         rsp,28h  
 pop         rsi  
 pop         rdi  
 ret  

TestCaseC: The generic struct with long generates code equal to the big struct code

 push        rdi  
 push        rsi  
 sub         rsp,28h  
 mov         rsi,rcx  
 mov         rcx,7FF960F05E50h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 mov         rdi,rax  
 mov         rcx,7FF960F05E50h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 cmp         rax,rdi  
 jne         00007FF961010629  
 mov         eax,1  
 mov         edx,2  
 mov         dword ptr [rsi],eax  
 mov         qword ptr [rsi+8],rdx  
 mov         rax,rsi  
 add         rsp,28h  
 pop         rsi  
 pop         rdi  
 ret  
 xor         eax,eax  
 xor         edx,edx  
 mov         dword ptr [rsi],eax  
 mov         qword ptr [rsi+8],rdx  
 mov         rax,rsi  
 add         rsp,28h  
 pop         rsi  
 pop         rdi  
 ret  

TestCaseD: The generic struct with string generates this, boxing and "other things (?)"

 push        r15  
 push        r14  
 push        rdi  
 push        rsi  
 push        rbp  
 push        rbx  
 sub         rsp,28h  
 mov         qword ptr [rsp+20h],rdx  
 mov         rbx,rcx  
 mov         rsi,qword ptr [rdx+10h]  
 mov         rcx,qword ptr [rsi]  
 test        ecx,1  
 jne         00007FF961010685  
 jmp         00007FF96101068C  
 mov         rcx,qword ptr [rcx-1]  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 mov         rdi,rax  
 mov         rcx,7FF960F06058h  
 call        JIT_GetRuntimeType (07FF9C062BBF0h)  
 cmp         rax,rdi  
 jne         00007FF961010739  
 mov         edi,1  
 mov         rcx,28990003580h  
 mov         rbp,qword ptr [rcx]  
 mov         rcx,7FF960F06058h  
 call        JIT_TrialAllocSFastMP_InlineGetThread (07FF9C0622510h)  
 mov         r14,rax  
 lea         r15,[r14+8]  
 lea         rcx,[r15]  
 mov         rdx,rbp  
 call        JIT_CheckedWriteBarrier (07FF9C0623E00h)  
 mov         dword ptr [r15+8],edi  
 mov         rcx,qword ptr [rsi]  
 test        ecx,1  
 jne         00007FF9610106F0  
 jmp         00007FF9610106F7  
 mov         rcx,qword ptr [rcx-1]  
 mov         rdx,qword ptr [rsi]  
 test        edx,1  
 jne         00007FF961010704  
 jmp         00007FF96101070B  
 mov         rdx,qword ptr [rdx-1]  
 cmp         qword ptr [r14],rcx  
 je          00007FF96101071B  
 mov         rcx,rdx  
 mov         rdx,r14  
 call        JIT_Unbox (07FF9C06B88C0h)  
 lea         rsi,[r14+8]  
 mov         rdi,rbx  
 call        JIT_ByRefWriteBarrier (07FF9C0623F20h)  
 movs        qword ptr [rdi],qword ptr [rsi]  
 mov         rax,rbx  
 add         rsp,28h  
 pop         rbx  
 pop         rbp  
 pop         rsi  
 pop         rdi  
 pop         r14  
 pop         r15  
 ret  
 xor         eax,eax  
 xor         edx,edx  
 mov         qword ptr [rbx],rax  
 mov         dword ptr [rbx+8],edx  
 mov         rax,rbx  
 add         rsp,28h  
 pop         rbx  
 pop         rbp  
 pop         rsi  
 pop         rdi  
 pop         r14  
 pop         r15  
 ret  
ilexp commented 7 years ago

Progress

Immediate ToDo

AndyAyersMS commented 7 years ago

The jit in 4.7.02046 (aka desktop CLR version 4.7) had a bug fix that inhibited some cases of the type equality optimizations. These were fixed in 4.7.1 which is available as a preview and which should be officially released soon.

On latest CoreCLR I get the following code for your second test case examples. I would expect similar results from 4.7.1 too. If you get a chance to try 4.7.1 and you don't get good results, please let me know.

TestCaseD is clearly not getting optimized. There are challenges here because the CLR will share method implementations over all ref types in TestCaseD, so the type of T is not know to the jit. However once the outcome of the type equality check is known it seems like we ought to be able to do better. I've opened dotnet/coreclr#14256 for this.

;;; Test case A
G_M22036_IG02:
       B801000000           mov      eax, 1

G_M22036_IG03:
       C3                   ret

;;; Test case B
G_M60708_IG02:
       B801000000           mov      eax, 1
       BA02000000           mov      edx, 2
       8901                 mov      dword ptr [rcx], eax
       48895108             mov      qword ptr [rcx+8], rdx
       488BC1               mov      rax, rcx

G_M60708_IG03:
       C3                   ret

;;; Test case C
G_M4495_IG02:
       B801000000           mov      eax, 1
       BA02000000           mov      edx, 2
       8901                 mov      dword ptr [rcx], eax
       48895108             mov      qword ptr [rcx+8], rdx
       488BC1               mov      rax, rcx

G_M4495_IG03:
       C3                   ret

;;; Test case D
G_M4522_IG01:
       4157                 push     r15
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC28             sub      rsp, 40
       4889542420           mov      qword ptr [rsp+20H], rdx
       488BD9               mov      rbx, rcx
       488BF2               mov      rsi, rdx

G_M4522_IG02:
       ;; This bit of code is looking up the handle for the actual type of T
       488B4E38             mov      rcx, qword ptr [rsi+56]
       488B09               mov      rcx, qword ptr [rcx]
       F6C101               test     cl, 1
       7404                 je       SHORT G_M4522_IG03
       488B49FF             mov      rcx, qword ptr [rcx-1]

G_M4522_IG03:
       ;; Here's the check if typeof(T) == typeof(GenericStruct<string>)
       48B8309598C6FC7F0000 mov      rax, 0x7FFCC6989530
       483BC8               cmp      rcx, rax
       ;;; branch to the default(T) case if we don't have that type
       0F8581000000         jne      G_M4522_IG08
       ;;; all the rest of this down to the ret is from 
       ;;;     return (T)(object)new GenericStruct<string>(1, "Hello");
       ;;; we should be able to avoid the box / unbox / copy
       BF01000000           mov      edi, 1
       48B9C8300090A8020000 mov      rcx, 0x2A8900030C8
       488B29               mov      rbp, gword ptr [rcx]
       48B9309598C6FC7F0000 mov      rcx, 0x7FFCC6989530
       E88508855F           call     CORINFO_HELP_NEWSFAST
       4C8BF0               mov      r14, rax
       4D8D7E08             lea      r15, bword ptr [r14+8]
       498D0F               lea      rcx, bword ptr [r15]
       488BD5               mov      rdx, rbp
       E873161F5F           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       41897F08             mov      dword ptr [r15+8], edi
       488B4E38             mov      rcx, qword ptr [rsi+56]
       488B09               mov      rcx, qword ptr [rcx]
       488BD1               mov      rdx, rcx
       8BC2                 mov      eax, edx
       83E001               and      eax, 1
       85C0                 test     eax, eax
       7404                 je       SHORT G_M4522_IG04
       488B52FF             mov      rdx, qword ptr [rdx-1]

G_M4522_IG04:
       85C0                 test     eax, eax
       7404                 je       SHORT G_M4522_IG05
       488B49FF             mov      rcx, qword ptr [rcx-1]

G_M4522_IG05:
       493916               cmp      qword ptr [r14], rdx
       7408                 je       SHORT G_M4522_IG06
       498BD6               mov      rdx, r14
       E823D13A5F           call     CORINFO_HELP_UNBOX

G_M4522_IG06:
       498D7608             lea      rsi, bword ptr [r14+8]
       488BFB               mov      rdi, rbx
       E887171F5F           call     CORINFO_HELP_ASSIGN_BYREF
       48A5                 movsq
       488BC3               mov      rax, rbx

G_M4522_IG07:
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret

G_M4522_IG08:
       33C0                 xor      rax, rax
       33D2                 xor      edx, edx
       488903               mov      gword ptr [rbx], rax
       895308               mov      dword ptr [rbx+8], edx
       488BC3               mov      rax, rbx

G_M4522_IG09:
       4883C428             add      rsp, 40
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret
ilexp commented 7 years ago

@AndyAyersMS Thanks for your insight on this, and glad to hear it's not just me :) I'll check out the new desktop 4.7.1 CLR after it's released and see if that improves things.

TestCaseD is clearly not getting optimized. There are challenges here because the CLR will share method implementations over all ref types in TestCaseD, so the type of T is not know to the jit. However once the outcome of the type equality check is known it seems like we ought to be able to do better.

Ahh, I see. That does sound a bit trickier than I originally thought. Will keep an eye on the issue you opened! Thanks again, really appreciated 👍

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress

Immediate ToDo

ilexp commented 7 years ago

Progress