Dav1dde / gl3n

OpenGL Maths for D (not glm for D).
http://dav1dde.github.com/gl3n/
Other
103 stars 49 forks source link

core.simd and gl3n #75

Open mkalte666 opened 7 years ago

mkalte666 commented 7 years ago

Greetings. I wonderd if it would be usefull to use https://dlang.org/spec/simd.html in gl3n. This could come in handy if you use gl3n alot in collision detection or something similar.

Would that make sense to do?

Dav1dde commented 7 years ago

Yes, that would make a lot of sense, back when I wrote gl3n and simd came up, I was waiting on std.simd, but that never happened ...

mkalte666 commented 7 years ago

I made a fork and will try to see if i can implement that somehow. I wouldn't count on me though, i don't have too much time :/

mkalte666 commented 7 years ago

So I think there is a problem that will start to exsist if simd is used to replace the non-simd math. Because stuff might become slower

performance, hasSimd = true...
Lots of ops took: 0.10047s

vs

performance, hasSimd = false...
Lots of ops took: 0.0750042s

The measured op was

for (int i = 0; i < 1_000_000; i++) {
    vec4 a = 43223.0;
    vec4 b = 1234.0;
    a+=b;
}

That slowdown becomes even worse if i use automated vectorization vector[] += r.vector[] takes 0.15, so 3 times as much time as when using float4 (in that case)

Now i changed the code a bit

vec4 a = 43223.0;
vec4 b = 1234.0;
for (int i = 0; i < 1_000_000; i++) {
       a+=b;
}

That results in the expected (even though tiny) speedup.

performance, hasSimd = true...
Lots of ops took: 0.0097749s

vs

performance, hasSimd = false...
Lots of ops took: 0.0159956s

These differences exsist because the vectors need to be loaded into the simd registers. So operations on the same set of vectors will speed up alot, the general use will slow down alot. So I think that, to implement this, there would be a need for a different set of functions that utilize this, because otherwise it would just be a slow down.

Dav1dde commented 7 years ago

Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.

Code which wants to accept both versions of vectors needs to use foo(T vec) is(some_vector!T), that's why I want Compiletime-Interfaces ...

mkalte666 commented 7 years ago

Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.

I tried to integrate it into the normal vector classes via an template argument and in itself that works fine. However i am not able to notice any speedup at all (tbh i only implemented the basic operations and testet them) - and i suspect the implementation of both the core.simd.Vector types and the __simd magic cause a lot of copying around of data, which is kinda locigal because the instructions run on the xmm/ymm registers.

Vector!(float,4,true) a,b,c;
a = b = c = 1234.23234;
for(long i = 0; i < 1_000_000; i++) {
    a+=b;
    a+=c;
    a+=a;
}

Should, in my understanding, run faster if a,b,c use core.simd.vector!(float[4]) - it however always ran slower than what i expected.

It would be nice to work with data within these registers like you can with the Intel/C++ compiler foo (the _mm_add_ss -like functions that take m128 and m256 types). So Id go so far to seperate the normal vector/matrix types from the SIMD acceleration completly. So then you would do something like

vec4 a (123,434,124,123);
vec4 b (434,342,323,434);
simdVec!vec4 areg = a;
simdVec!vec4 breg = b;
for(int i = 0; i < 1_000_000; i++) {
    areg += breg; /// ADDPS
    breg += areg; /// ADDPS
}
float magnitude = areg.magnitude; /// can be done with DOTPS and SQRTPS 
a = areg.toVec();
b = areg.toVec();

The main difference would be that the simd-type (wich idially would equal a media register) allows no direct access to the memory to avoid any copying.

Also im kinda missing the AVX (256bit ymm registers, double[4]-stuff) support in core.simd.__simd. Hmm.

Am I thinking right or am I blubbering complete bullshit? O.o

EDIT: I might have found the reason: this code:

import core.simd;

void doStuff()
{
    float4 x = [1.0,0.4,1234.0,124.0]; 
float4 y = [1.0,0.4,1234.0,124.0]; 
float4 z = [1.0,0.4,1234.0,123.0];
  for(long i = 0; i<1_000_000; i++) {
    x += y;
    x += z;
    z += x;
  }
}

Can be split in two parts. The first one is the assignment:

movaps xmm0,XMMWORD PTR [rip+0x0]        # f <void example.doStuff()+0xf>
movaps XMMWORD PTR [rbp-0x40],xmm0
movaps xmm1,XMMWORD PTR [rip+0x0]        # 1a <void example.doStuff()+0x1a>
movaps XMMWORD PTR [rbp-0x30],xmm1
movaps xmm2,XMMWORD PTR [rip+0x0]        # 25 <void example.doStuff()+0x25>
movaps XMMWORD PTR [rbp-0x20],xmm2

Well ok it also copies the stuff onto the stack? meh. Now the math in the loop:

 movaps xmm3,XMMWORD PTR [rbp-0x30]
 movaps xmm4,XMMWORD PTR [rbp-0x40]
 addps  xmm4,xmm3
 movaps XMMWORD PTR [rbp-0x40],xmm4
 movaps xmm0,XMMWORD PTR [rbp-0x20]
 movaps xmm1,XMMWORD PTR [rbp-0x40]
 addps  xmm1,xmm0
 movaps XMMWORD PTR [rbp-0x40],xmm1
 movaps xmm2,XMMWORD PTR [rbp-0x40]
 movaps xmm3,XMMWORD PTR [rbp-0x20]
 addps  xmm3,xmm2
 movaps XMMWORD PTR [rbp-0x20],xmm3

OUCH! This should simply be

addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0

I guess i should report that as compiler bug? https://issues.dlang.org/show_bug.cgi?id=16605

Dav1dde commented 7 years ago

Thanks for looking into all of this.

I can't really help you here since my knowledge of SSE/SIMD instructions is very limited, you might want to ask in #D on freenode, there are some very smart people with compiler insight who probably can help you in a timely manner.

mkalte666 commented 7 years ago

No Problem, I enjoy this kind of stuff :)

Im gonna head there, because im still not sure if my knowleadge about the SSE/SIMD is enough to come to the right conclusions. Lets see where this is headed!

mkalte666 commented 7 years ago

It was me who was the fool! "-release" != "-O -release -boundscheck=off" Now that looks like something!

Running ./gl3nspeed 
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.140215s!
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.050686s!
Dav1dde commented 7 years ago

That's almost 3 times faster!

mkalte666 commented 7 years ago

That's almost 3 times faster!

It gets better!

Enter loop count
10000000
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.139876s!
Speed of the magnitude operation on float |vec4|
took: 5.30766s! 
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.0507099s!
Speed of the magnitude operation on float |vec4|
took: 1.02721s! 

Im gonna clean this up a bit and make push to my fork so you can take a look at it - if it fits the guidelines/way you wan't stuff to be done for gl3n.

mkalte666 commented 7 years ago

Here are my changes so far: https://github.com/Dav1dde/gl3n/compare/master...mkalte666:master

I know that this is missing tests etc. These i will write as soon as i can i guess.

The speed test tool i used is https://github.com/mkalte666/gl3nspeed

You have to compile gl3n/gl3nspeed with "DLFAGS="-release -O -boundscheck=ff" dub ". Or tell me how i can get dub to use -O xD

Dav1dde commented 7 years ago

Looks good, minor style things but in general I like how it is done!

You gonna look into matrices as well?

mkalte666 commented 7 years ago

Looks good, minor style things but in general I like how it is done!

Thanks, im trying ^^

You gonna look into matrices as well?

If i find the time. Im not sure how well that can be done and what instructions already exsist that could help out. Also I still want look into #68, and i guess that could be combined

Thinking about speed and not about time management on my side this would be a massive improvement however: "4x4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts)" http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly

mkalte666 commented 7 years ago

One thing I wonder is if operations with scalars (vec3*float etc) should be vectorized. While the operation itself would speed up, as long as the numerical value is not const, the resulting code would almost always be slower because the scalar would have to be loaded into a vector beforehand.

The speedy way of doing a (any operation) multiplicaion would be to hold a (const?) vector somewhere and then do the operations. So doing

Vector!(float,4,true) scalar = 4.0;
Vector!(float,4,true) foo = 1234.01234;
Vector!(float,4,true) bar = 13.2434;
foo *= scalar;
bar *= scalar; 
// ..... probably do this many times 

would almost always result in faster code than if one would do

foo *= 4.0;
bar *= 4.0; 

because the operator doesn't know if it operates on a const value or a variable. If there is a way to seperate them (detecting if a value is const), that it could be done though - I don't know how.