Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

__FLOAT128__ macro incorrectly defined for CUDA target #46528

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR47559
Status CONFIRMED
Importance P normal
Reported by Fabian Knorr (bugs-llvm@fabian-knorr.info)
Reported on 2020-09-17 01:52:29 -0700
Last modified on 2021-01-20 09:52:19 -0800
Version 10.0
Hardware PC Linux
CC llvm-bugs@lists.llvm.org, tra@google.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
The CUDA target does not support __float128, but Clang defines __FLOAT128__
regardless. This breaks feature detection in libstdc++, causing this trivial
invocation to fail:

> /usr/bin/clang++ -x cuda --cuda-gpu-arch=sm_75 -c -std=gnu++17 /dev/null
Output:

> In file included from <built-in>:1:
> In file included from
/usr/lib/clang/10.0.1/include/__clang_cuda_runtime_wrapper.h:36:
> In file included from /usr/bin/../lib64/gcc/x86_64-pc-linux-
gnu/10.2.0/../../../../include/c++/10.2.0/cmath:47:
> /usr/bin/../lib64/gcc/x86_64-pc-linux-
gnu/10.2.0/../../../../include/c++/10.2.0/bits/std_abs.h:103:7: error:
__float128 is not supported on this target
>   abs(__float128 __x)

It appears that there is a check missing in
clang/lib/Basic/Targets/OSTargets.h, which should not set HasFloat128 when
compiling for CUDA.
Quuxplusone commented 4 years ago
Which CUDA version are you using?
Which compilation do you have in mind when you say 'compiling for CUDA'?
Host compilation? Device? Both?

I can not reproduce the issue with recent clang:

% bin/clang++ -x cuda --cuda-path=$HOME/local/cuda-10.1 --cuda-gpu-arch=sm_75 -
c -std=gnu++17 /dev/null
% bin/clang --version
clang version 12.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/build/release/bin
Quuxplusone commented 3 years ago
The error message for the command line in the OP might depend on the standard
library version installed. The essence of the problem can be reproduced with
this code:

    #ifdef __FLOAT128__
    extern __float128 f;
    #endif

Compiling this with clang -x cuda produces

    /tmp/float128.cu:2:8: error: __float128 is not supported on this target
    extern __float128 f;
        ^
    1 error generated when compiling for sm_61.

Even though the declaration should have been removed by the preprocessor, but
__FLOAT128__ and __SIZEOF_FLOAT128__ are incorrectly defined:

    $ clang -x cuda -dM -E /dev/null --cuda-gpu-arch=sm_75 | grep FLOAT128 | sort | uniq
    #define _GLIBCXX_USE_FLOAT128 1
    #define __FLOAT128__ 1
    #define __HAVE_DISTINCT_FLOAT128 0
    #define __HAVE_DISTINCT_FLOAT128X __HAVE_FLOAT128X
    #define __HAVE_FLOAT128 0
    #define __HAVE_FLOAT128X 0
    #define __HAVE_FLOAT128_UNLIKE_LDBL (__HAVE_DISTINCT_FLOAT128 && __LDBL_MANT_DIG__ != 113)
    #define __LDOUBLE_REDIRECTS_TO_FLOAT128_ABI 0
    #define __SIZEOF_FLOAT128__ 16

Not sure if affects both host and device compilation.

Clang:

    clang version 11.0.1
    Target: x86_64-pc-linux-gnu
    Thread model: posix

CUDA: Both 10.1 (latest officially supported by Clang) as well as 11.2.
Quuxplusone commented 3 years ago
TL;DR; Everyone compiling projects with CUDA may need to use `-mno-float128`.

The question is -- how it all should work in the end?

Given:
* Host-side compilation does support float128
* GPU-side compilation does not support float128.
* we want preprocessed code to look the same for host/GPU compilations of the
same TU.
* we want to maintain consistent set of features across all host TUs in the
build and that includes consistency with C++ compilations.

IMO, the only ways to fix the problem is to either add support for float128 on
GPU or disable float128 for all compilations at the build level. The former is
unlikely to happen any time soon. The latter has do be done by the individual
end user in their build.

If we only disable float128 for CUDA compilations, that would help maintaining
host/GPU consistency, but would leave us open to issues in mixed C++/CUDA apps,
unless the build is changed to make sure float128 is disabled everywhere.

Another downside to disabling float128 in clang for CUDA only is that it will
make it very hard to spot and deal with issues caused by C++/CUDA mismatch in
supported features. I'd rather fail in a painfully obvious way rather than make
compilation work at the price of hard to debug issues in other places. Chasing
ODR violations is not fun at all.
Quuxplusone commented 3 years ago
This might be a separate bug, but the -mno-float128 flag is ignored entirely
(C, C++ and CUDA):

    $ clang -mno-float128 -x c -dM -E /dev/null | grep FLOAT128__ | sort | uniq
    $ clang -mno-float128 -x c++ -dM -E /dev/null | grep FLOAT128__ | sort | uniq
    $ clang -mno-float128 -x cuda -dM -E /dev/null | grep FLOAT128__ | sort | uniq

all print

    clang-11: warning: argument unused during compilation: '-mno-float128' [-Wunused-command-line-argument]
    #define __FLOAT128__ 1
    #define __SIZEOF_FLOAT128__ 16

and the example from above still fails. The only reliable workaround I've found
is

    -U__FLOAT128__ -U__SIZEOF_FLOAT128__

But that, of course, has the ODR problems you mentioned.
Quuxplusone commented 3 years ago
Depending on the implementation complexity, a way around this might be to
support the __float128 representation in device code, but not the actual
operations. __int128 sets an (unintentional?) precedent for this: Loading,
storing and adding __int128s works, but attempting a division fails in the
backend:

    fatal error: error in backend: Undefined external symbol "__divti3"
    clang-11: error: clang frontend command failed with exit code 70 (use -v to see invocation)

Now the error message is of course less descriptive than "__int128 not
supported", but I think this partial support is actually useful. Imagine a
kernel sorting an array of structs containing __float128 members, but the order
does not depend on those floats. The swap operations would still be well-
defined.
Quuxplusone commented 3 years ago
Storage-only type might be a reasonable compromise. We will not be able to
generate working code for everything (as in your example with int128), but
it would at least allow us to avoid divergence from the host during parsing.

I'll look into it when I have time, but it probably will not happen for a while.