dmap / testmap broken under linux (Ubuntu 19.04)

IceReaper commented 4 years ago

When compiling a map under linux, some brush triangles are missing causing the map to leak. Compiling the same map under windows works flawless.

Ill try to provide a doom3 compatible testmap showcasing the problem later, as i am using a custom game using RBDOOM3-BFG as its engine.

ericwomer commented 4 years ago

How are you coming along on that custom test map? It doesn't need to be fancy, just something to demonstrate the issue at hand. Also not to be discouraging, but is there a reason you're not using an engine like UE4, Unity, or Godot? They would be more adept at giving you what you would need and want in an engine for a custom game.

BielBdeLuna commented 2 years ago

I have the same errors, dmap skips some random triangles, and I've encountered some portaling errors too, I've made a video that shows it

DanielGibson commented 2 years ago

there's a similar bugreport for dhewm3: https://github.com/dhewm/dhewm3/issues/147 I can't reproduce the problem though (with the test map linked in the first post), maybe it has been fixed accidentally.

As a lucky guess, try removing all instances of -ffast-math from CMakeLists.txt If that doesn't help, replace -O3 with -O2. If that still doesn't help it's probably not caused by overzealous compileroptimizations and I have no idea what fixed it.

It would probably help if you shared your testmap and not only the video ;)

BielBdeLuna commented 2 years ago

Unfortunately none of those changes to CMakeLists.txt solved the issue. what do -O3 and -O2 do? are those levels of optimization?

DanielGibson commented 2 years ago

are those levels of optimization?

Yes, -O3 is a higher level of optimization and tends to do more "dangerous" optimizations that can break code that is not 100% standard-conformant (spoiler: no code is).

Could you share your testmap? Otherwise no one can debug this.. (and I'd like to test if dhewm3 is affected)

BielBdeLuna commented 2 years ago

Could you share your testmap?

https://github.com/BielBdeLuna/OTE_testmaps/tree/vilcabamba

DanielGibson commented 2 years ago

I can't really reproduce your problems there, but I noticed

I need to use noclip a lot even if there were no obvious barriers (I haven't looked at the map in an editor at all, maybe there's clip brushes or something?)
It doesn't work on dhewm3 at all, unfortunately (everything is black, flashlight doesn't help), even after converting the pngs to tga
there were lots of dmap warnings like WARNING: brush has multiple area portal sides at 1088 1504 320 No idea what those warnings mean exactly and if they're actually bad, but as you mentioned portal bugs it might be relevant

BielBdeLuna commented 2 years ago

1. I need to use noclip a lot even if there were no obvious barriers (I haven't looked at the map in an editor at all, maybe there's clip brushes or something?)
3. there were lots of dmap warnings like `WARNING: brush has multiple area portal sides at 1088 1504 320`
   No idea what those warnings mean exactly and if they're actually bad, but as you mentioned portal bugs it might be relevant

it seems that dmap fails at detecting the visportals, the visportals seem not be registered as such, and so the material is invisible but the movement through them seems to be impeded. edit the visportals work, those errors seem to be due the faces of the visportal not creating the visportal making such an error

2. It doesn't work on dhewm3 at all, unfortunately (everything is black, flashlight doesn't help), even after converting the pngs to tga

wait, didn't I put the needed TGA's? edit I checked all the necessary tgas are there

it seems that everything is failing, if you "reloadDecls" what sort of errors do you gets from the materials? did you put the folder "vilcabamba" as a mod folder? or you put "OTE_testmaps/vilcabamba" as a mod folder?

edit I downloaded the map in another computer and with RBDoom3BFG it loads correctly, the level compiles (with those errors ) but it loads correctly ( with all the textures working ) but still unfortunately with the errors displayed in the video.

DanielGibson commented 2 years ago

wait, didn't I put the needed TGA's? edit I checked all the necessary tgas are there

As everything was black (in dhewm3) I tried adding all the other OTE assets (which used PNGs which I converted) but that didn't help either

But in latest RBDoom3BFG git (master branch) I didn't get those glitches seen in your video. No idea what's going on here, maybe uninitialized variables that for some reason have a less broken random value than on your system? I don't know.

BielBdeLuna commented 2 years ago

all the other OTE assets

ah, ok, but the map wasn't meant to be used with all the assets from OTE just as a single map test

I pushed a map without visportals, it displays the same triangular errors in the same places in my end when compiled, minus the visportal related compiling error warnings lines.

I wonder are you using Intel? both of my computers are Ryzen based, I wonder if there is something against AMD optimizations? I can't test this on a Intel to test this possibility.

DanielGibson commented 2 years ago

I'm using a Ryzen 2700X (with a Geforce GTX 970), so AMD vs Intel can be ruled out.

I got your map to work with dhewm3 now, by adjusting the filenames (png => tga) in the materials as well. I still can't reproduce the bug though (neither in dhewm3 nor in RBDoom3BFG) - can you test if it also happens with dhewm3 on your machine?

BielBdeLuna commented 2 years ago

ah I found the culprit for the black textures, the gray texture ( that the level used still makes a reference to a PNG texture ) I don't know why it worked in RBDoom3BFG as it doesn't use PNG for textures edit that's not true PNG IS supported. edit , in dhewm3 compiles without the errors in those newly created triangles in the brushes.

I pushed the changes now it should work

DanielGibson commented 2 years ago

So the same map works (meaning: looks correct) if compiled with dhewm3 but not if compiled with RBDoom3BFG?

I don't know why it worked in RBDoom3BFG as it doesn't use PNG for textures

RBDoom3BFG does support loading PNGs. They probably get converted to .bimage before being actually used, but PNG is supported as far as I can tell.

BielBdeLuna commented 2 years ago

yeah, something lays rotten within the source of dmap in RBDoom3BFG unlike Dhewm3. And it has to be in the code that generates the optimized triangles, because I don't see triangles that follow the full brushes that I used for the creation of the map, but that meet where other brushes collide, so those are optimized triangles failing

RBDoom3BFG does support loading PNGs.

ah, it's true, now I see the readme... sorry :)

BielBdeLuna commented 2 years ago

oh, also OTE compiles that map fine, it's RBDoom3BFG that has some kind of problem.

I'm using the Master branch (last commit), and OpenGL of RBDoom3BFG. ...compiling Vulkan to see if there is any luck, but I don't think it might make any change to the issue.

DanielGibson commented 2 years ago

Damn, just realized that I can reproduce the bug after all, in both RBDoom3BFG and dhewm3 - if I enable ONATIVE in CMake.

So it seems like it's related to optimization after all

BielBdeLuna commented 2 years ago

certainly ONATIVE off compiles the map correctly! :) so following cmake:

    if(NOT CMAKE_CROSSCOMPILING AND ONATIVE)
        if(CMAKE_SYSTEM_PROCESSOR MATCHES "((powerpc|ppc)64le)|(mips64)")
            add_definitions(-mcpu=native)
        else()
            add_definitions(-march=native)
        endif()
    endif()

the culprit is:

add_definitions(-march=native)

isn't it?

BielBdeLuna commented 2 years ago

I wonder, can we command the compiler not to optimize some part of the code? some source file in specific?

BielBdeLuna commented 2 years ago

first I'll try to see if ONATIVE ON but -O0 clears the error, if so then I will be trying to add:

#pragma GCC push_options
#pragma GCC optimize("O0")

to the beginning of every file in dmap and:

#pragma GCC pop_options

In their end.

edit Effectively -O0 solves the issue for all the compiled code in the engine, now I'll try to add the pragma command to the dmap code files in order to optimize all with the normal settings minus the dmap code

DanielGibson commented 2 years ago

it's something in dmap/optimize.cpp, more specifically in idlib/math/Vector.h included there

haven't narrowed it down further yet

BielBdeLuna commented 2 years ago

in order to clarify and to narrow out my test benches:

a laptop computer with zen+ 3700U
a desktop computer with zen2 3700X

and using GCC 10.3.0 in Ubuntu 21.4

DanielGibson commented 2 years ago

I'm debuggin this in dhewm3. I found out the following:

It's somewhere in bool PointInTri( const idVec3 &p, const mapTri_t *tri, optIsland_t *island ) (from tools/compilers/dmap/optimize.cpp), in combination with #include idlib/math/Vector.h.

Click to show source

I moved that function into a separate source file (I called it optimize2.cpp) which looks like: ```c++ #pragma GCC push_options #pragma GCC optimize("O0") // commenting this line out breaks the build // moving this include above #pragma GCC optimize("O0") breaks the build // (usually it's included implicitly by dmap.h, but including it first allows to use different optimization settings on it) #include "idlib/math/Vector.h" // restore normal optimization for the rest of the file => still works as long as Vector.h isn't optimized // (while only optimizing Vector.h and not PointInTri() breaks it) #pragma GCC pop_options #include "tools/compilers/dmap/dmap.h" bool PointInTri( const idVec3 &p, const mapTri_t *tri, optIsland_t *island ) { idVec3 d1, d2, normal; // the normal[2] == 0 case is not uncommon when a square is triangulated in // the opposite manner to the original d1 = tri->optVert[0]->pv - p; d2 = tri->optVert[1]->pv - p; normal = d1.Cross( d2 ); if ( normal[2] < 0 ) { return false; } d1 = tri->optVert[1]->pv - p; d2 = tri->optVert[2]->pv - p; normal = d1.Cross( d2 ); if ( normal[2] < 0 ) { return false; } d1 = tri->optVert[2]->pv - p; d2 = tri->optVert[0]->pv - p; normal = d1.Cross( d2 ); if ( normal[2] < 0 ) { return false; } return true; } ``` in the original optimize.cpp I just comment out that function and added a prototype for it, like `extern bool PointInTri( const idVec3 &p, const mapTri_t *tri, optIsland_t *island );` so it can be found by the functions using it.

So it appears like something in Vector.h, quite likely idVec3's cross product (just a hunch, haven't checked yet!), gets optimized in a way that breaks the PointInTri() function which causes some faces to be left out

DanielGibson commented 2 years ago

Ok, turns out it's indeed the cross product, can be easily verified without even a second source file: Right before static bool PointInTri( const idVec3 &p, const mapTri_t *tri, optIsland_t *island ) { ... add:

#pragma GCC push_options
#pragma GCC optimize("O0") // commenting this out triggers the bug
// copy of idVec3::Cross() as standalone function
static idVec3 myCross( const idVec3& v1, const idVec3 &v2 ) {
    return idVec3( v1.y * v2.z - v1.z * v2.y, v1.z * v2.x - v1.x * v2.z, v1.x * v2.y - v1.y * v2.x );
}
#pragma GCC pop_options // restore default optimizations

then in PointInTri() replace all three instances of normal = d1.Cross( d2 ); with normal = myCross(d1, d2);

After recompiling and running dmap on the testmap, it should look fine. If you comment out the #pragma GCC optimize("O0") line and run dmap again, it's broken again.

I guess I'll have to look at disassembly next to figure out what's really going wrong.. not that I'm any good at reading assembler

UPDATE: I stripped this further down: The whole cross product isn't even needed, as only the .z component is used like this: if ( normal[2] < 0 ) { (and the optimized code doesn't calculate the rest). So I wrote an even simpler function that does exactly this - and made it non-inline - luckily I could still reproduce the bug with it. It looks like:

#pragma GCC push_options
#pragma GCC optimize("O1") // O0 and O1 work, O2 causes the glitches

static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) __attribute__((noinline));
static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) {
    float z = v1.x * v2.y - v1.y * v2.x;
    bool ret = z < 0.0f;
    return ret;
}
#pragma GCC pop_options // restore default optimizations
// followed by PointInTri() implementation

and in PointInTri() I replaced normal = myCross(d1, d2); if ( normal[2] < 0 ) { with if( norm2lt0(d1, d2) ) {

I also found out that with -O1 optimization (like with -O0) the bug doesn't happen but with -O2 it does (at least on my system with GCC 7.5.0)

Here's the disassembly in case anyone can read that (I haven't looked very hard at it yet, I need to look up every single instruction..):

```c++ # broken (-O2) 0000000000000810 <_ZL8norm2lt0RK6idVec3S1_.isra.3>: float z = v1.x * v2.y - v1.y * v2.x; 810: c5 f2 59 d2 vmulss xmm2,xmm1,xmm2 bool ret = z < 0; 814: c5 f0 57 c9 vxorps xmm1,xmm1,xmm1 static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) { 818: 55 push rbp 819: 48 89 e5 mov rbp,rsp } 81c: 5d pop rbp float z = v1.x * v2.y - v1.y * v2.x; 81d: c4 e2 69 9b c3 vfmsub132ss xmm0,xmm2,xmm3 bool ret = z < 0; 822: c5 f8 2e c8 vucomiss xmm1,xmm0 826: 0f 97 c0 seta al } 829: c3 ret 82a: 66 0f 1f 44 00 00 nop WORD PTR [rax+rax*1+0x0] # working (-O1) static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) __attribute__((noinline)); static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) { 2b0: 55 push rbp 2b1: 48 89 e5 mov rbp,rsp float z = v1.x * v2.y - v1.y * v2.x; 2b4: c5 fa 10 07 vmovss xmm0,DWORD PTR [rdi] 2b8: c5 fa 59 46 04 vmulss xmm0,xmm0,DWORD PTR [rsi+0x4] 2bd: c5 fa 10 4f 04 vmovss xmm1,DWORD PTR [rdi+0x4] 2c2: c5 f2 59 0e vmulss xmm1,xmm1,DWORD PTR [rsi] 2c6: c5 fa 5c c1 vsubss xmm0,xmm0,xmm1 bool ret = z < 0; 2ca: c5 f0 57 c9 vxorps xmm1,xmm1,xmm1 2ce: c5 f8 2e c8 vucomiss xmm1,xmm0 2d2: 0f 97 c0 seta al return ret; } 2d5: 5d pop rbp 2d6: c3 ret 2d7: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 2de: 00 00 ```

coldtobi commented 2 years ago

does this happen for certain input values or generally? Do you have some test vectors and expected results?

FWIW: I've playing with the function at https://godbolt.org/z/4GnzrYsfW ; code generated by it however is different, more lke the working example; for all over O>=1 I've also tried different compiler versions… Also tried for clang in the hope the clang diagnostics has some hints…

BielBdeLuna commented 2 years ago

I can confirm the -O1 doesn't display the bug in RBDoom3BFG. -O2 does.

I've found that page that explains very basically some methods on how to stop the optimization: https://programfan.github.io/blog/2015/04/27/prevent-gcc-optimize-away-code/

maybe we could use this method from the page in order to do the cross operations:

template <typename T> inline doNotOptimizeAway(T&& datum) {
    asm volatile ("" : "+r" datum);
}

and then maybe:

static idVec3 myCross( const idVec3& v1, const idVec3 &v2 ) {

        //  v1.y * v2.z - v1.z * v2.y
        float 1x;
        doNotOptimizeAway( 1x = v1.y );
        doNotOptimizeAway( 1x *= v2.z );
        float 2x;
        doNotOptimizeAway( 2x = v1.z );
        doNotOptimizeAway( 2x *= v2.y );
        float final_x;
        doNotOptimizeAway( final_x = 1x );
        doNotOptimizeAway( final_x -= 2x );

        // v1.z * v2.x - v1.x * v2.z
        float 1y;
        doNotOptimizeAway( 1y = v1.z );
        doNotOptimizeAway( 1y *= v2.x );
        float 2y;
        doNotOptimizeAway( 2y = v1.x );
        doNotOptimizeAway( 2y *= v2.z );
        float final_y;
        doNotOptimizeAway( final_y = 1y );
        doNotOptimizeAway( final_y -= 2y );

        // v1.x * v2.y - v1.y * v2.x

        float 1z;
        doNotOptimizeAway( 1z = v1.x );
        doNotOptimizeAway( 1z *= v2.y );
        float 2z;
        doNotOptimizeAway( 2z = v1.y );
        doNotOptimizeAway( 2z *= v2.x );
        float final_z;
        doNotOptimizeAway( final_z = 1z );
        doNotOptimizeAway( final_z -= 2z );

        return idVec3( final_x, final_y, final_z );
    //return idVec3( v1.y * v2.z - v1.z * v2.y, v1.z * v2.x - v1.x * v2.z, v1.x * v2.y - v1.y * v2.x );
}

like explained in that page?

edit I see that my proposed solution resembles somewhat to the assembly operations in -O1 for every dimension of the result. only making them "volatile"

BielBdeLuna commented 2 years ago

so is the error everywhere where the cross product is used? or is it only in the dmap optimization usage of it?

coldtobi commented 2 years ago

My experience says that whether optimization or not is likely a red herring… IMHO it is a subtle bug in the code or (unlikely) a compiler problem. So I'm skeptical about this trying to trick the compiler…

One idea: That could be the order of execution: With optimization the compiler is (beside other things) allowed to reorder instructions. I've seen situations where this might be problemtic, especially if the individual floats are using significant different exponents, so that for example an a-b is basically a, if b is so small that the difference cannot be represented… (lacking of words to express myself, sorry if that confuses more than helping)

one obersavtion: the (broken) assembly snippet from Daniel seems to have omitted the load instructions: It soley works on registers. That could mean that the function has been inlined and gcc ignoring the attribute. As noinline is only a hint for the compiler, it is not bound to obey, this is possible: See also https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html) That could be a hint that the problem is somewhere in the calling code.

PS: https://godbolt.org/z/csfqW56Kh -- observe that test() is not calling norm2lt0() in the assembly code AFAICS… (On O3) If you go O1 it does… So compiler thinks it can calculate the result without the help of the function…

BielBdeLuna commented 2 years ago

I've got it solved! it seems that my previous solution doesn't work, Daniel's works perfectly! :)

Optimized to 3 as per normal, but adding Daniel function before PointInTri() with a slight change (and also the due replacements in PointInTri() ) :

static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) __attribute__((optimize("O0")));
static bool norm2lt0( const idVec3& v1, const idVec3 &v2 ) {

    float z = v1.x * v2.y - v1.y * v2.x;
    return z < 0.0f;
}

no need of the preprocessers if it's just that function

edit made a pull request with the changes: https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/573

Maybe we should put norm2lt0 in the general vectors file? I propose rename norm2lt0 to NormalCrossIsNegative "only true if the resulting normal cross product is negative"

BielBdeLuna commented 2 years ago

I added a "CrossZisNegative" function to Vector.h inside the idVec3 under the Cross function:

    bool            CrossZisNegative( const idVec3& a ) const __attribute__((optimize("O0")));

afterwards I could inline it after the inlined cross function:

ID_INLINE bool idVec3::CrossZisNegative( const idVec3& a ) const {
    float cross_z = x * a.y - y * a.x;
    return cross_z < 0.0f;
}

and eventually I have reupdated the three cases of PointInTri to use the CrossZisNegative with:

        //normal = d1.Cross( d2 );
    if( d1.CrossZisNegative( d2 ) )
    {
        return false;
    }

also now in PointInTri normal is no longer needed

it works fine.

I haven't made this change in the pull request yet. Do you think we might need this function in vector.h or it's too specific for such a problem to not needing to generalize it? in fact I couldn't find another instance of normal[2] < 0 anywhere else in the whole /neo/ folder

DanielGibson commented 2 years ago

one obersavtion: the (broken) assembly snippet from Daniel seems to have omitted the load instructions: It soley works on registers. That could mean that the function has been inlined and gcc ignoring the attribute.

I think that the x86_64 calling convention (on Unices) pass suitable arguments in those xmmN registers, so it should be fine to operate directly on them instead of doing loads. In my -O2 code that function indeed was called if the noinline attribute was set.

That could be the order of execution: With optimization the compiler is (beside other things) allowed to reorder instructions.

That's likely - note that the -O2 version uses a multiplication (vmulss) and the vfmsub132ss instruction (which multiplies and then subtracts), while the -O1 version does two multiplications and then a subtraction like one would expect.

One set of values that triggers the bug is: d1: {x = -277.333313, y = -69.3333282, z = 0} or {0xc38aaaaa 0xc28aaaaa 0x0} d2: {x = -341.333313, y = -85.3333282, z = 0} or {0xc3aaaaaa 0xc2aaaaaa 0x0} the floating point values are not exact, I think, but those hex values are a bit-wise copies of the floats, i.e. exact, so they can be converted like this: uint32_t x = 0xc38aaaaa; float res; memcpy(&res, &x, 4);

The cross product's .z (or [2]) value of d1 and d2 is very close to zero.

I'll go on investigating later today, I just got up..

It's possible that a better solution would be to replace if ( normal[2] < 0 ) with if ( normal[2] < -0.00001f ) (or some similar value), i.e. using a little bit of tolerance

BielBdeLuna commented 2 years ago

I tried the if ( normal[2] < -0.00001f ) and it's still being "optimized" but with if ( normal[2] < -0.01f ) it works. so it's not an optimization problem? it's a precision problem? maybe the optimization discards some precision form that value?

DanielGibson commented 2 years ago

here's a compiler explorer link with testcode: https://gcc.godbolt.org/z/r1983Edrq

TBH I'm surprised that the results are that different - 0.0 vs -0.0007595485 is not just a super small deviation..

DanielGibson commented 2 years ago

Ok, I think I know why this happens and I think it's a compiler bug in GCC - clang does not behave like this, unless you use -ffast-math in which case you should expect broken math code. RBDoom3BFG uses it (though with an additional flag or two to prevent some kinds of problems), dhewm3 does not.

As a reminder, the calculation going wrong is float z = v1.x * v2.y - v1.y * v2.x. If v1.x * v2.y and v1.y * v2.x are identical, z should obviously be 0.0. This isn't the case here - even if v1 and v2 are identical, it doesn't return 0, when compiling with GCC, enabling the FMA x86 extension (like with -march=znver1) and using -O2 or above.

From the disassembly of the broken version:

vmulss xmm2,xmm1,xmm2 // this should be the `v1.y * v2.x` part
...
vfmsub132ss xmm0,xmm2,xmm3 // this should be the `v1.x * v2.y - former_result` part

(I'll ignore that this is operating on __m128 4 float vectors, that's not relevant here somehow it works out)

The problem is that vfmsub132ss instruction. Its documentation says:

Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

Now I'm not exactly sure what "infinite precision" is supposed to mean here, but it'll certainly be more than 32bit precision, let's just assume 64bit (double) precision. This means that our v1.x * v2.y - v1.y * v2.x is basically transformed to

float a = v1.y * v2.x; // vmulss xmm2,xmm1,xmm2 - stored in single precision floats

double intermediate = (double)v1.x * (double)v2.y; // first step of vfmsub132ss - "infinite precision intermediate"

float result = intermediate - (double)a; // second step of vfmsub132ss

So if v1.x * v2.y and v1.y * v2.x are identical, result will contain the precision loss you'd get from doing the calculation in double and then storing it in a single-precision float. Now keep in mind that floats have relative precision - the bigger the number (absolute, negative or not doesn't matter), the smaller the precision in number of decimals, see https://blog.demofox.org/2017/11/21/floating-point-precision/ for details.

In the compiler explorer testcase I linked above, v1 = (-277.3333129883, -69.3333282471, 0.000000) and v2 = (-341.3333129883, -85.3333282471, 0.000000) and v1.x*v2.y is about 23 665.775, according to my calculator. Between 16 384 and 32 768 32bit floats have a precision of about 0.001953125, while 64bit doubles have a precision of about 0.0000000000036379788, so the difference between a and intermediate in the pseudo-code above could be up to around 0.00195 even if the values should be identical - so the actual deviation of -0.0007595485 isn't surprising anymore. (Note that -0.0007595485 can easily be stored in a single-precision float, because the precision around 0 is much better, for example for 0.25 it's 0.000000029802322 and for even smaller values it's even better. So that difference can easily be stored in a float)

The working ASM code (when using -O0 or -O1 or if the FMA CPU extension isn't used) just does two multiplications (storing the results in a 32bit floats) and then subtracts the results, so it returns exactly 0.0 for these kind of cases.

I don't think compilers should do an "optimization" like that. I'm not familiar with the relevant standards (probably IEEE float standard and the C++ standard?), but the GCC manpage itself has this entry:

-fassociative-math Allow re-association of operands in series of floating-point operations. This violates the ISO C and C++ language standard by possibly changing computation result. NOTE: re-ordering may change the sign of zero as well as ignore NaNs and inhibit or create underflow or overflow (and thus cannot be used on code that relies on rounding behavior like (x + 2**52) - 2**52. May also reorder floating-point comparisons and thus may not be used when ordered comparisons are required.

Which suggests that floating point operations can't be arbitrarily reordered or rounding behavior changed without violating the standard - I'd argue that what's happening here is very similar (this option is implicitly enabled by -ffast-math)

BielBdeLuna commented 2 years ago

I guess -ffast-math speeds up math for the renderer, isn't it? I mean we can go without it, isn't it?

what if we keep this function out of the optimization and we search a way to stop the optimization for MSVC or a workaround for MSVC, like a function specific for Windows?

also:

float z = v1.x * v2.y - v1.y * v2.x. If v1.x v2.y and v1.y v2.x are identical, z should obviously be 0.0.

maybe we can test those results separated and via code make z 0.0?

DanielGibson commented 2 years ago

I guess -ffast-math speeds up math for the renderer, isn't it? I mean we can go without it, isn't it?

It should be disabled, it's garbage.

But even without -ffast-math GCC miscompiles that function.

I don't think MSVC has that problem, if we used some GCC-specific pragma or function annotation as a workaround that must be guarded by #ifdef __GNUC__ or something like that

BielBdeLuna commented 2 years ago

But even without -ffast-math GCC miscompiles that function.

but at -O0 then doesn't display the error, it is still being miscompiled by then?

DanielGibson commented 2 years ago

Did you read my posts? GCC does this if the FMA CPU extension is enabled and if -O2 or above are used. Not with -O0, not with -O1.

BielBdeLuna commented 2 years ago

Did you read my posts?

yes but it's not an easy issue to follow for me, that's why I'm asking.

by FMA CPU you mean Fused Multiply-Add extensions? isn't it? are those extension enabled with the -ffast-math or are those two things different? -march=znver1 is never called in the CMakeLists.txt, is it implicit with -ffast-math?

DanielGibson commented 2 years ago

I updated the post to make things a bit clearer without having to keep all the former posts (that investigated which compiler settings trigger it) in mind.

by FMA CPU you mean Fused Multiply-Add extensions?

yes

are those extension enabled with the -ffast-math

No, it's enabled with -march=native (if the host CPU supports it), or -march=znver1 or some other architecture that supports it. or probably -mfma. Probably using that extension isn't a problem in general, but using it to optimize this specific kinda of operation (basically a*b - c*d) is a bad idea. Clang only uses it in this case if -ffast-math is set (which is ok, -ffast-math means "you're free to fuck up my math code in non-standard ways as long as we can pretend it gets faster"), GCC also does it without -ffast-math, if -O2 (or higher) is set (which is not ok, because just -O2 or -O3 is not supposed to violate the standard, and at least -O2 should avoid "dangerous" optimizations in general). ~~There might be a specific -f option that's implicitly enabled by -O2 which controls this with GCC, I don't know.~~ UPDATE: Apparently the flag that enables this (bad) optimization is -fexpensive-optimizations - which unfortunately is very unspecific and might include other optimizations that are generally desirable, so it's not good as a global option. Luckily, #pragma GCC optimize("-fno-expensive-optimizations") instead of #pragma GCC optimize("O1") seems to work

I'm still not sure how to properly fix this issue. At first I thought adding some tolerance for the comparison is a good idea, but OTOH then some points that even with non-broken code would be considered outside the triangle (=> small negative numbers) suddenly get treated differently. As this is (IMO) a compiler bug, maybe using pragmas or function attributes to reduce optimization to -O1 (if compiling with GCC) for the cross products might be a better solution after all. Either way, -ffast-math should be completely removed; see dhewm3's CMakeLists.txt for what related options should be safe.

BielBdeLuna commented 2 years ago

ah, ok znver1 are the extensions of Zen cpus (version 1, whatever that is)

So if you're attribute to -O1 the cross product function or -O0, so even if you're using GCC you're not optimizing it, isn't it? as you said if you're using clang you don't find the problem, so maybe we could just optimize -O0 just for GCC as you pointed already, and then we would only need to search a way to not optimize it for Windows too, I think it's with the /0 functions ( I don't know anything about MSVC, sorry )

maybe we could use -O2 as the standard isnted of -O3 and test how much more slow the engine would get without the -ffast-math.

edit so establishing a -O2, without -ffast-math compile, and no other changes, still displays the error in dmap. It's only when you go-O1 that the issue is not displayed.

edit 2 changed the cross function in idVec3 in vector.h:

#ifdef __GNUC__
    idVec3          Cross( const idVec3& a ) const __attribute__((optimize("O1")));
#else
    idVec3          Cross( const idVec3& a ) const;
#endif

and compiled with -O2 and without -ffast-math without further changes. it doesn't display the issue.

shouldn't this be it?

DanielGibson commented 2 years ago

See https://github.com/dhewm/dhewm3/commit/320c15f63ad7949e475c2203355d74ddec385871 for a fix

DanielGibson commented 2 years ago

GCC bugreport for this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100839

BielBdeLuna commented 2 years ago

so it's -ffp-contract=off, have you tested it in dhewm3?

edit let me tell you we have a winner here! -O3 and -ffp-contract=off and the issue is gone! no other changes, just changes in the CMakeLists.txt!

how do you check the ASM commands that the compiler do? how do you check for sure that all the vmulss and vfmsub132ss hocus pocus in the code are the right one calls for sure? :)

coldtobi commented 2 years ago

I've used Daniels Compiler Explorer to try something: This seems to hint the compiler, at least compiler explorer says it works too with O2 and Daniels test vectors.

static float crossZ_O2( const idVec3& v1, const idVec3 &v2 ) { float intermediate1 = v1.x v2.y; float intermediate2 = v1.x v2.y; float z = intermediate1 - intermediate2; return z; }`

(the compiler explorer has some extra lines to output things commented out that I used to see the intemediate values; I've omitted them here)

Just wanted to share this, I don't know if this is suitable.

DanielGibson commented 2 years ago

Nice idea (and I think I even read somewhere that this kind of optimization shouldn't be done then, at least in C?), but float intermediate2 = v1.x * v2.y; this should've been v1.y * v2.x - and with that change it doesn't seem to work - same ASM is generated as with the original code :-/

I'm annoyed that the GCC developers don't even seem to see any problem with this behavior. They must really hate their users..

coldtobi commented 2 years ago

thanks for seeing my damn copy+paste f*up, Daniel. Yes, gcc folks are difficult… /me wonders if this bug triggers on other math stuff as well, in the game.

~~I'll try another thing: using strict c++11 mode…~~ (Daniel did so already, just saw that in the bug report for gcc…) No need to repeat that.

coldtobi commented 2 years ago

Another idea, as in the gcc bugreport, they've said: "C is just different here from C++ :)." … if I did not stupid things this time, it worked: wrapping the function in extern "C" { … } :

https://godbolt.org/z/1brM46bjc

BielBdeLuna commented 2 years ago

but if there might be other functions corrupted by contract optimization wouldn't then make sense to just use -ffp-contract=off instead of going back instruction per instruction testing if the math is correct?

I think the idea is: "you wanted optimization and that's it, you weren't forced to optimize your code, so it's expectable that optimization might break your code" so something in the line of "If you want speedy code, code better rather than just rely on compiler optimizing your code" isn't it?

I think it's more a problem of the documentation of the compiler that fails to inform of such behaviour and how to combat it (with that single command) rather than gcc devs hating the gcc users.

so this also might mean that those same optimizations we're calling might break more stuff, and we might have to scan the documentation in order to find more commands to stop the code optimization troubles somewhere else.

DanielGibson commented 2 years ago

wrapping the function in extern "C" { … } :

oh my god. nice that it works but..

I still don't get that the GCC people think that it's completely normal and fine that you can't implement a simple cross product with their "compiler" without using obscure compilerflags, pragmas or hacks. This is completely unacceptable - it's worse than -fstrict-aliasing, because this definitely is valid code and -O2 is not a flag expected to break code. This will cause a lot of trouble (and countless wasted developer hours from debugging) once more people enable FMA support - I think the only reason this hasn't blown up spectacularly/publicly (yet) is that by default on x86_64 it's not used.

I'm thinking about ditching GCC and explicitly removing support for it from dhewm3 (with build errors or something) and tell people to use clang instead. It's time GCC loses the last relevance it still has.

BielBdeLuna commented 2 years ago

in our case we are not just asking the compile to do a simple cross product, we are also asking it to optimize it.

what do you think? Do I make a pull request with the simple changes in CMakeLists.txt?

RobertBeckebans / RBDOOM-3-BFG

dmap / testmap broken under linux (Ubuntu 19.04) #436