Open Quuxplusone opened 6 years ago
Bugzilla Link | PR38198 |
Status | NEW |
Importance | P enhancement |
Reported by | Eric Schweitz (eschweitz@nvidia.com) |
Reported on | 2018-07-17 11:47:58 -0700 |
Last modified on | 2018-08-02 10:13:49 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | hfinkel@anl.gov, hideki.saito@intel.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
See also this conversation on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2018-June/124374.html
X86 Vector function ABI might need a clarification on how shorter than XMM data needs to be populated. In the case of <2 x i32> on XMM, it needs to be legalized to <4 x i32> and occupy the lower half of that, as opposed to legalizing to <2 x i64>.
From my perspective, this is more of LLVM vector type legalizer's problem. Legalizing <2 x i32> to <4 x i32> is a no-op if you have x86-like packed SIMD. Legalizing <2 x i32> to <2 x i64> is not a no-op. Vector type legalizer should take that aspect of target based decision into account
Hope this help.s
Thanks for your comment, Hideki.
Adding that the <2 x i32> was our first issue with potential ABI mismatches. We've since run into issues of how the type <4 x double> (for example) gets legalized on different X86 subtargets. Is it a pair of XMM registers, a single YMM, both, depends, neither? It appears this is an ongoing discussion on the mailing list.
That's the thread I started. :) Being one of the co-authors of this ABI and
part of the team that implemented this in ICC, I never had an issue in
interpreting the ABI. My question was more on the side of knowing how we need
to break up, where in LLVM we should be breaking up.
Beyond one vector is clearly defined in the vector function ABI.
https://software.intel.com/sites/default/files/managed/b4/c8/Intel-Vector-Function-ABI.pdf
It is based on the ISA class you are using.
GCC doesn't strictly follow the ISA class mangling scheme (and some other
things), but the ISA class concept still exists.
Using the following simple example,
#pragma omp declare simd simdlen(4)
double foo(double x) {
return x + 1;
}
GCC and ICC with -vecabi=gcc generates
_ZGVdN4v_foo
_ZGVdM4v_foo
_ZGVcN4v_foo
_ZGVcM4v_foo
_ZGVbN4v_foo
_ZGVbM4v_foo
"M" are masked versions. Let's ignore them for simplicity.
_ZGVbN4v_foo() uses two XMMs.
# --- foo..bN4v(double)
_ZGVbN4v_foo:
# parameter 1: %xmm0
# parameter 2: %xmm1
_ZGVcN4v_foo() and _ZGVdN4v_foo() uses one YMM.
# --- foo..cN4v(double)
_ZGVcN4v_foo:
# parameter 1: %ymm0
# --- foo..dN4v(double)
_ZGVdN4v_foo:
# parameter 1: %ymm0
So, assuming that your target supports YMM, both two XMM and one YMM are legal.
From the caller side, GCC's vector function ABI says both XMM and YMM
interfaces are available. Call whichever is appropriate and make sure to use
the appropriately mangled name.
From Intel's SVML perspective (one of the Veclibs), we need to extend the
veclib table such that it includes the availability of entry point based on the
target ISA.
Conclusion of the RFC mail thread was that vectorizer needs to legalize the
call. As such, Veclib needs to provide enough info about what kind of entry
points are available for what target and what class of ISA is used for each
entry. ---- That's sufficient enough to legalize SVML and thus that's what we
are looking into.
You might need something similar.
Given that RFC concluded that vector type legalization of veclib is vectorizer's responsibility, vectorizer should also take care of legalizing <2 x i32> into <4 x i32>, not dependent on CG's legalizer fix for that. Thought I should clarify that. CG's legalizer should still be fixed for perf reasons, however.