Get the core hrend and vrend assembly routines to compile properly in GCC

Ericson2314 / Voxlap

Making Ken Silverman's "Voxlap" voxel graphics engine run on any platform with SDL.

97 stars 14 forks source link

Get the core hrend and vrend assembly routines to compile properly in GCC #19

Open Ericson2314 opened 11 years ago

Ericson2314 commented 11 years ago

Certainly use "p" (link-time constant) constraints for pointers to global variables. Ideally use "dummy variables" to avoid hard coding any intermediate constants either.

Lensman commented 11 years ago

I'm pretending I didn't see this one.

Ericson2314 commented 11 years ago

:) It actually shouldn't be too bad. Just a lot of tedious stuff. I am going to do it without dummy constraints first and just use hard-coded clobbers

Ericson2314 commented 11 years ago

a7594c38361a3ae354771a7728cdd5e1b2308931

Lensman commented 11 years ago

Just a quick note, if you want the C versionof these functions to run faster use this (inverse sqrt);

inline float f_rsqrt( const float number ) { long i; float x2, y; const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;
    i  = 0x5f3759df - ( i >> 1 );
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
    // 2nd iteration, this can be removed if you don't need accuracy
    //y  = y * ( threehalfs - ( x2 * y * y ) );   
    return y;

} then change lines in vrendz/hrendz ; _(float )(p0+i) = (float)c0->dist/rsqrt(dirx_dirx+diry_diry); to (float *)(p0+i) = (float)c0->dist_f_rsqrt(dirx_dirx+diry_diry);

It makes the c version run about 50% speed of the asm version, which is an adequate improvement. I've nearly finished converting the renderer to intrinsics, so this issue is almost complete.

Lensman commented 11 years ago

You don't need the second iteration of the Newton-Raphson approximation. It's adequate with one iteration in the renderer as the inputs are quantized from an original full precision sqrt, and stored in a lookup table in Ken's original code.

Ericson2314 commented 11 years ago

Ah, this is http://en.wikipedia.org/wiki/Fast_inverse_square_root I assume? There must also be an intrinsic which uses the single instruction to do this I'd hope.

Lensman commented 11 years ago

That's right, it's the infamous code from quake that has had whole articles written about it. The intrinsic is the reciprical sqrt which you will find in the v/hrend(z)sse part of the renderer which operates on 4 values at a time. I've analysed the renderer in AMD code analyst, and it's completely memory contrained when using sse. The c version doesn't have the same issues, as it plods through the data in lockstep with the memory fetches anyway. The only way to make that bit faster is to refactor the castdat structure so that color and distance are not stored next to each other, or to possibly get rid of the look up altogether and just calculate in registers. I'll put that on the back burner as an experiment for future meanderings.

Ericson2314 commented 11 years ago

Wait, so is the instruction itself or intrinsic bad with the memory-access bound? Also could you push your work?

Lensman commented 11 years ago

I'll tidy up a bit, and push so you can have a look. When I say memory bound, in this instance, the renderer is trying to take advantage of lookup tables (angstart table in this case), which is an integration of angles made by vline/hline These fetches and lookups may be redundant on newer cpus, because thay can calculate sincosf faster than a memory access, hence making the function memory bound. Did I explain that correctly?

Lensman commented 11 years ago

As far as intrinsics go, any intrinsics that use the __m64 datatype are not supported on x86_64, that's not to say that you can't use mmx registers in assembly. It can all be mitigated with ifdefs, so there can still be a non fatbin version of the executable which is coalesced at compile time. It's just something to be aware of. Take a look at the Intel optimization manual for gotchas. It's related to emms as well becuase mmx registers are shared with with fpu 80 bit registers. All of x86/87 is a kludge, because of backwards compatability. Just like Windows, the price you pay for a general solution is complexity. Compared to most chipsets x86 is a frankenmonster ;)

Ericson2314 commented 11 years ago

Ha, I thought you were going to say "Just like Windows, the price you pay for backwards compatibility, is kludge".

OK, yeah I didn't make the connection between no __m64 on x86-64 and intrinsics. Yeah fatbin stuff just make it harder to think, I wouldn't mind having dedicated binaries. Ideally our builds will mostly be MinGW anyways where intrinsics and x86-64 work together fine.