Open Quuxplusone opened 4 years ago
Bugzilla Link | PR47559 |
Status | CONFIRMED |
Importance | P normal |
Reported by | Fabian Knorr (bugs-llvm@fabian-knorr.info) |
Reported on | 2020-09-17 01:52:29 -0700 |
Last modified on | 2021-01-20 09:52:19 -0800 |
Version | 10.0 |
Hardware | PC Linux |
CC | llvm-bugs@lists.llvm.org, tra@google.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
Which CUDA version are you using?
Which compilation do you have in mind when you say 'compiling for CUDA'?
Host compilation? Device? Both?
I can not reproduce the issue with recent clang:
% bin/clang++ -x cuda --cuda-path=$HOME/local/cuda-10.1 --cuda-gpu-arch=sm_75 -
c -std=gnu++17 /dev/null
% bin/clang --version
clang version 12.0.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/build/release/bin
The error message for the command line in the OP might depend on the standard
library version installed. The essence of the problem can be reproduced with
this code:
#ifdef __FLOAT128__
extern __float128 f;
#endif
Compiling this with clang -x cuda produces
/tmp/float128.cu:2:8: error: __float128 is not supported on this target
extern __float128 f;
^
1 error generated when compiling for sm_61.
Even though the declaration should have been removed by the preprocessor, but
__FLOAT128__ and __SIZEOF_FLOAT128__ are incorrectly defined:
$ clang -x cuda -dM -E /dev/null --cuda-gpu-arch=sm_75 | grep FLOAT128 | sort | uniq
#define _GLIBCXX_USE_FLOAT128 1
#define __FLOAT128__ 1
#define __HAVE_DISTINCT_FLOAT128 0
#define __HAVE_DISTINCT_FLOAT128X __HAVE_FLOAT128X
#define __HAVE_FLOAT128 0
#define __HAVE_FLOAT128X 0
#define __HAVE_FLOAT128_UNLIKE_LDBL (__HAVE_DISTINCT_FLOAT128 && __LDBL_MANT_DIG__ != 113)
#define __LDOUBLE_REDIRECTS_TO_FLOAT128_ABI 0
#define __SIZEOF_FLOAT128__ 16
Not sure if affects both host and device compilation.
Clang:
clang version 11.0.1
Target: x86_64-pc-linux-gnu
Thread model: posix
CUDA: Both 10.1 (latest officially supported by Clang) as well as 11.2.
TL;DR; Everyone compiling projects with CUDA may need to use `-mno-float128`.
The question is -- how it all should work in the end?
Given:
* Host-side compilation does support float128
* GPU-side compilation does not support float128.
* we want preprocessed code to look the same for host/GPU compilations of the
same TU.
* we want to maintain consistent set of features across all host TUs in the
build and that includes consistency with C++ compilations.
IMO, the only ways to fix the problem is to either add support for float128 on
GPU or disable float128 for all compilations at the build level. The former is
unlikely to happen any time soon. The latter has do be done by the individual
end user in their build.
If we only disable float128 for CUDA compilations, that would help maintaining
host/GPU consistency, but would leave us open to issues in mixed C++/CUDA apps,
unless the build is changed to make sure float128 is disabled everywhere.
Another downside to disabling float128 in clang for CUDA only is that it will
make it very hard to spot and deal with issues caused by C++/CUDA mismatch in
supported features. I'd rather fail in a painfully obvious way rather than make
compilation work at the price of hard to debug issues in other places. Chasing
ODR violations is not fun at all.
This might be a separate bug, but the -mno-float128 flag is ignored entirely
(C, C++ and CUDA):
$ clang -mno-float128 -x c -dM -E /dev/null | grep FLOAT128__ | sort | uniq
$ clang -mno-float128 -x c++ -dM -E /dev/null | grep FLOAT128__ | sort | uniq
$ clang -mno-float128 -x cuda -dM -E /dev/null | grep FLOAT128__ | sort | uniq
all print
clang-11: warning: argument unused during compilation: '-mno-float128' [-Wunused-command-line-argument]
#define __FLOAT128__ 1
#define __SIZEOF_FLOAT128__ 16
and the example from above still fails. The only reliable workaround I've found
is
-U__FLOAT128__ -U__SIZEOF_FLOAT128__
But that, of course, has the ODR problems you mentioned.
Depending on the implementation complexity, a way around this might be to
support the __float128 representation in device code, but not the actual
operations. __int128 sets an (unintentional?) precedent for this: Loading,
storing and adding __int128s works, but attempting a division fails in the
backend:
fatal error: error in backend: Undefined external symbol "__divti3"
clang-11: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Now the error message is of course less descriptive than "__int128 not
supported", but I think this partial support is actually useful. Imagine a
kernel sorting an array of structs containing __float128 members, but the order
does not depend on those floats. The swap operations would still be well-
defined.
Storage-only type might be a reasonable compromise. We will not be able to
generate working code for everything (as in your example with int128), but
it would at least allow us to avoid divergence from the host during parsing.
I'll look into it when I have time, but it probably will not happen for a while.