Open Ericson2314 opened 11 years ago
I'm pretending I didn't see this one.
:) It actually shouldn't be too bad. Just a lot of tedious stuff. I am going to do it without dummy constraints first and just use hard-coded clobbers
a7594c38361a3ae354771a7728cdd5e1b2308931
Just a quick note, if you want the C versionof these functions to run faster use this (inverse sqrt);
inline float f_rsqrt( const float number ) { long i; float x2, y; const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// 2nd iteration, this can be removed if you don't need accuracy
//y = y * ( threehalfs - ( x2 * y * y ) );
return y;
} then change lines in vrendz/hrendz ; _(float )(p0+i) = (float)c0->dist/rsqrt(dirx_dirx+diry_diry); to (float *)(p0+i) = (float)c0->dist_f_rsqrt(dirx_dirx+diry_diry);
It makes the c version run about 50% speed of the asm version, which is an adequate improvement. I've nearly finished converting the renderer to intrinsics, so this issue is almost complete.
You don't need the second iteration of the Newton-Raphson approximation. It's adequate with one iteration in the renderer as the inputs are quantized from an original full precision sqrt, and stored in a lookup table in Ken's original code.
Ah, this is http://en.wikipedia.org/wiki/Fast_inverse_square_root I assume? There must also be an intrinsic which uses the single instruction to do this I'd hope.
That's right, it's the infamous code from quake that has had whole articles written about it. The intrinsic is the reciprical sqrt which you will find in the v/hrend(z)sse part of the renderer which operates on 4 values at a time. I've analysed the renderer in AMD code analyst, and it's completely memory contrained when using sse. The c version doesn't have the same issues, as it plods through the data in lockstep with the memory fetches anyway. The only way to make that bit faster is to refactor the castdat structure so that color and distance are not stored next to each other, or to possibly get rid of the look up altogether and just calculate in registers. I'll put that on the back burner as an experiment for future meanderings.
Wait, so is the instruction itself or intrinsic bad with the memory-access bound? Also could you push your work?
I'll tidy up a bit, and push so you can have a look. When I say memory bound, in this instance, the renderer is trying to take advantage of lookup tables (angstart table in this case), which is an integration of angles made by vline/hline These fetches and lookups may be redundant on newer cpus, because thay can calculate sincosf faster than a memory access, hence making the function memory bound. Did I explain that correctly?
As far as intrinsics go, any intrinsics that use the __m64 datatype are not supported on x86_64, that's not to say that you can't use mmx registers in assembly. It can all be mitigated with ifdefs, so there can still be a non fatbin version of the executable which is coalesced at compile time. It's just something to be aware of. Take a look at the Intel optimization manual for gotchas. It's related to emms as well becuase mmx registers are shared with with fpu 80 bit registers. All of x86/87 is a kludge, because of backwards compatability. Just like Windows, the price you pay for a general solution is complexity. Compared to most chipsets x86 is a frankenmonster ;)
Ha, I thought you were going to say "Just like Windows, the price you pay for backwards compatibility, is kludge".
OK, yeah I didn't make the connection between no __m64 on x86-64 and intrinsics. Yeah fatbin stuff just make it harder to think, I wouldn't mind having dedicated binaries. Ideally our builds will mostly be MinGW anyways where intrinsics and x86-64 work together fine.
Certainly use "p" (link-time constant) constraints for pointers to global variables. Ideally use "dummy variables" to avoid hard coding any intermediate constants either.