Wrong code with optimization on i386 FreeBSD

Quuxplusone commented 5 years ago


Bugzilla Link	PR41224
Status	NEW
Importance	P normal
Reported by	Steve Kargl (sgk@troutmask.apl.washington.edu)
Reported on	2019-03-25 11:04:28 -0700
Last modified on	2019-04-02 13:20:32 -0700
Version	7.0
Hardware	PC FreeBSD
CC	andrew.kaylor@intel.com, craig.topper@gmail.com, dimitry@andric.com, emaste@freebsd.org, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments	`a.c` (2092 bytes, text/x-csrc) `without_volitale.s` (8180 bytes, text/plain) `with_volitale.s` (8179 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 21664
Program that demonstrates the issue

The attached testcase, a.c, demonstrates a code generation issue on FreeBSD
running on an i686 class hardware (i.e., 32-bit i386/387).  FreeBSD sets the
i387 FPU to a 53-bit precision when the FPU is first accessed.  clang or llvm
seems to have no knowledge of this setting and unconditionally assumes a 64-bit
precision.  This leads to wrong for floating point codes that use the 32-bit
float type when optimization is used.  Consider,

gcc8 (FreeBSD Ports Collection) 8.3.0

gcc8 -fno-builtin -O0 -o z a.c -lm && ./z
gcc8 -fno-builtin -O1 -o z a.c -lm && ./z
gcc8 -fno-builtin -O2 -o z a.c -lm && ./z
gcc8 -fno-builtin -O3 -o z a.c -lm && ./z

The above command lines yield

  Maximum ULP: 2.297073
# of ULP > 21: 0

This is the expected result.

gcc8 -fno-builtin -O0 -DKLUDGE -o z a.c -lm && ./z
gcc8 -fno-builtin -O1 -DKLUDGE -o z a.c -lm && ./z
gcc8 -fno-builtin -O2 -DKLUDGE -o z a.c -lm && ./z
gcc8 -fno-builtin -O3 -DKLUDGE -o z a.c -lm && ./z

The above command lines yield

  Maximum ULP: 2.297073
# of ULP > 21: 0

This is the expected result.

Now, consider

FreeBSD clang version 7.0.1 (tags/RELEASE_701/final 349250)
(based on LLVM 7.0.1)
Target: i386-unknown-freebsd13.0
Thread model: posix

/usr/bin/clang -fno-builtin -O0 -o z a.c -lm && ./z

The above command line yields

  Maximum ULP: 2.297073
# of ULP > 21: 0

This is the expected result.

/usr/bin/clang -fno-builtin -O1 -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O2 -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O3 -o z a.c -lm && ./z

The above command lines yield

  Maximum ULP: 23.061242
# of ULP > 21: 39

This is not the expected result.  In fact, in my numerical testsuite I have
observed 6 digit Max ULP estimates (i.e., only a single digit is correct).

/usr/bin/clang -fno-builtin -O0 -DKLUDGE -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O1 -DKLUDGE -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O2 -DKLUDGE -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O3 -DKLUDGE -o z a.c -lm && ./z

The above command lines yield

  Maximum ULP: 2.297073
# of ULP > 21: 0

which is again the expected results.  The -DKLUDGE option causes
the source to use 'volatile float x, y' instead of just 'float x, y'.
AFAICT, from the generated asm (see attachments), the use of volatile
forces clang to spill/reload x, y (thus, using the correct precision
for the type).

Quuxplusone commented 5 years ago

Attached a.c (2092 bytes, text/x-csrc): Program that demonstrates the issue

Quuxplusone commented 5 years ago

Attached without_volitale.s (8180 bytes, text/plain): Assembly when a.c is compiled without -DKLUDGE

Quuxplusone commented 5 years ago

Attached with_volitale.s (8179 bytes, text/plain): Assembly when a.c is compiled with -DKLUDGE

Quuxplusone commented 5 years ago

Changed the "Importance" field to "normal".

Quuxplusone commented 5 years ago

A discussion of the bug can be found in the FreeBSD toolchain mailing list archive at

https://lists.freebsd.org/pipermail/freebsd-toolchain/2019-March/004458.html

Quuxplusone commented 5 years ago

Myself and Andy Kaylor here at Intel spent a great deal of time playing around with this today using both clang and gcc. I don't believe the precision being set to 53 is necessary to hit the issue. I was able to get the test to report errors on the normal linux configuration.

As I think we mentioned in the freebsd mailing list, we are passing x and y to dp_csinh without rounding to float after the additions. We spill them around the call to csinhf using a 64 bit stack slot. This is due to the X87 codegen treating an fp_extend as nothing more than a copy from float register class to double register class. And the register coalescer rewriting some of the machine IR to use the double register class. This causes spill slot size to be calculated as 64-bits.

Using volatile circumvents this and forces a store as float instead of the spill. This causes the value to be rounded to float before being extended to double.

gcc's codegen seems to be affected by -std=c11 vs -std=gnu11. I believe the difference is really -fexcess-precision=standard vs -fexcess-precision=fast. When -fexcess-precision=fast is in effect, gcc's maximum ULP gets worse if dp_csinh is called before csinhf. Though it does not exceed the limit of 21.

It looks as though -fexcess-precision=standard causes gcc to insert intermediate casts to long double in their intermediate representation. At least on linux. That's what I observed from dumping the output of the 004t.original in godbolt. They might be casts to double on freebsd when the precision is 53 bits. I assume this has some effect on how instructions are generated later.

Another interesting note, I noticed clang does set FLT_EVAL_METHOD to 1 instead of 2 on some versions of NetBSD without SSE. But not for FreeBSD and we don't seem to do anything with the information other than set the define.

Quuxplusone commented 5 years ago

(In reply to Craig Topper from comment #5)
> Myself and Andy Kaylor here at Intel spent a great deal of time playing
> around with this today using both clang and gcc. I don't believe the
> precision being set to 53 is necessary to hit the issue. I was able to get
> the test to report errors on the normal linux configuration.

Craig,

Thanks for taking a look at this issue.  FreeBSD on i386/387
has been using 53-bit precision for likely more than 2 decades.
To save you some code spelunking, one finds (sorry about long url)

https://svnweb.freebsd.org/base/head/sys/x86/include/fpu.h?revision=314436view=markup

lines 186-208 are

/*
* The hardware default control word for i387's and later coprocessors is
* 0x37F, giving:
*
*      round to nearest
*      64-bit precision
*      all exceptions masked.
*
* FreeBSD/i386 uses 53 bit precision for things like fadd/fsub/fsqrt etc
* because of the difference between memory and fpu register stack arguments.
* If its using an intermediate fpu register, it has 80/64 bits to work
* with.  If it uses memory, it has 64/53 bits to work with.  However,
* gcc is aware of this and goes to a fair bit of trouble to make the
* best use of it.
*
* This is mostly academic for AMD64, because the ABI prefers the use
* SSE2 based math.  For FreeBSD/amd64, we go with the default settings.
*/
#define __INITIAL_FPUCW__       0x037F
#define __INITIAL_FPUCW_I386__  0x127F
#define __INITIAL_NPXCW__       __INITIAL_FPUCW_I386__
#define __INITIAL_MXCSR__       0x1F80
#define __INITIAL_MXCSR_MASK__  0xFFBF

The comment above that refers to GCC knowing about this setting can
be traced to

https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/config/i386/freebsd.h?revision=267494&view=markup

lines 120-123 are

/* FreeBSD sets the rounding precision of the FPU to 53 bits.  Let the
   compiler get the contents of <float.h> and std::numeric_limits correct. */
#undef TARGET_96_ROUND_53_LONG_DOUBLE
#define TARGET_96_ROUND_53_LONG_DOUBLE (!TARGET_64BIT)

I've looked through GCC sources, and know the above effects the
processing of gcc/config/i386/i386-modes.def, but that's as far
as I'm able to go at the moment.

--
steve

Quuxplusone / LLVMBugzillaTest

Wrong code with optimization on i386 FreeBSD #40193