Closed SRSSoftware closed 8 years ago
Hi. I've been working on this independently ;-)
I think I have most of it working. I'll merge yours then apply my changes - also adding Float128Float64.
I do not know much about what is done here and what it does offer... but feeling glad that there is someone with knowledge doing some supportive work. Nice :-)
Cool! I made a little app that can output all the of the sse functions. no need to waste time manually typing them in. it takes 5 seconds to run the app and it spits out all of the files needed. If you let me know when float64 is in then I do another pull request with the complete set of float64, float128 and double128 functions to make it easier?
GWRon Its for optimization. It allows you to do vectorized math and much more ( bit shifting / masking )
A basic example is that you can store 4 x 32 floats/ints in a float128/int128 and using the sse functions the cpu can do regular math on all 4 components for the cost of 1 instruction. The alternative is that the cpu has to load a variable from memory into a register, do the math, store the value back to memory 4 times to get the same result! With planning you can get incredible speedups ( because all values can stay in registers and you work on 4 at once ) for heavy duty math work - especially useful in cpu side 32 bit pixel manipulation and cpu side 3d math work ( vectors and matrices ).
Thanks for the elaboration. Seems it is something useful for frameworks... but as long as I do not see a bottleneck I will stay away from these kinds of "witchcraft" :-)
Also I assumed that GCC and the likes already optimize such things, else eg. the matrix manipulation (brucey's mojo port) could benefit from it.
I hope that I've done it right this time :p
I've renamed the pub.intrinsic to pub.xmmintrin more inline with its functionality. Feel free to change it if you want to. The latest bcc commit doesn't compile therefore the Double128 intrinsics aren't included in here at the momenti.