Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

X86 calling convention issue #37171

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR38198
Status NEW
Importance P enhancement
Reported by Eric Schweitz (eschweitz@nvidia.com)
Reported on 2018-07-17 11:47:58 -0700
Last modified on 2018-08-02 10:13:49 -0700
Version trunk
Hardware PC Linux
CC hfinkel@anl.gov, hideki.saito@intel.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
First, it's not clear if this is a bug or some design choice. We're seeking
some clarity on the issue.

We have a pure function that we'd like to add to LLVM as a vectorizable
intrinsic. The operation has a signature of

  double opn(double, i32);

The LLVM vectorizer then creates an operation with a signature of

  <2 x double> opnv(<2 x double>, <2 x i32>);

which we've added to the tables supporting the -fveclib option, etc.  So far,
so good.

In GCC, one might write this vector code rather directly as (See below for a
more complete example.)

  typedef double v2f64 __attribute__((vector_size (16)));
  typedef int v2i32 __attribute__((vector_size (8)));
  v2f64 opnv(v2f64 d, v2i32 i);

In clang (and gcc), we observe that the second argument i on x86 is declared as
"double" (as can be seen in the LLVM IR dump).

  declare <2 x double> @opnv(<2 x double>, double);

The vector of 2 i32 values are coerced to double, passed in half of the xmm
register. e.g.,

  xmm = <i32_1, i32_2, xxx, xxx>

However, the vectorizer prefers the signature (as shown above)

  declare <2 x double> opnv(<2 x double>, <2 x i32>)

In this case, depending on the -mcpu target, etc. we can get a PSHUFD or
VPSHUFD instruction to coerce the <2 x i32> to a <2 x i64>, it seems. In these
cases we may see either of the following lane assignments.

  xmm = <i32_1, i32_2, i32_2, xxx>
  xmm = <i32_1,  xxx,  i32_2, xxx>

In the first case, the lane assignments are compatible with the clang code for
the argument i. Unfortunately, in the latter case, it appears the <2 x i32>
vector is promoted to <2 x i64> and the called routine can access the wrong
lane for the 2nd i32 value.

Here is a simple LLVM IR to reproduce the code gen.

---
target datalayout = "e-p:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@a = global <2 x double> <double 4.0, double 5.0>
@b = global <2 x i32> <i32 27, i32 28>

define void @fun() {
L.entry:
        %a = load <2 x double>, <2 x double>* @a, align 8
        %b = load <2 x i32> , <2 x i32>* @b, align 8
        %c = call <2 x double>  @whatever (<2 x double> %a, <2 x i32> %b)
        ret void
}

declare <2 x double> @whatever(<2 x double>, <2 x i32>)
---

% llc -mcpu=prescott -O2 veci32.ll
% grep xmm1 veci32.s
        movq    b(%rip), %xmm1          # xmm1 = mem[0],zero
        pshufd  $212, %xmm1, %xmm1      # xmm1 = xmm1[0,1,1,3]
% llc -mcpu=sandybridge -O2 veci32.ll
% grep xmm1 veci32.s
        vpmovzxdq       b(%rip), %xmm1  # xmm1 = mem[0],zero,mem[1],zero

And lastly, here is a C example using the GCC vector extension to pass a <2 x
i32>.

---
typedef int v2si __attribute__ ((vector_size (8)));

typedef double v2d __attribute__ ((vector_size (16)));

v2d do_op(v2si i);

v2d test(v2si e) {
  return do_op(e);
}
---

% clang -O2 -emit-llvm -S gccvecext.c
% grep do_op gccvecext.ll
  %call = tail call <2 x double> @do_op(double %e.coerce) #2
declare dso_local <2 x double> @do_op(double) local_unnamed_addr #1
Quuxplusone commented 6 years ago

See also this conversation on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2018-June/124374.html

Quuxplusone commented 6 years ago

X86 Vector function ABI might need a clarification on how shorter than XMM data needs to be populated. In the case of <2 x i32> on XMM, it needs to be legalized to <4 x i32> and occupy the lower half of that, as opposed to legalizing to <2 x i64>.

From my perspective, this is more of LLVM vector type legalizer's problem. Legalizing <2 x i32> to <4 x i32> is a no-op if you have x86-like packed SIMD. Legalizing <2 x i32> to <2 x i64> is not a no-op. Vector type legalizer should take that aspect of target based decision into account

Hope this help.s

Quuxplusone commented 6 years ago

Thanks for your comment, Hideki.

Adding that the <2 x i32> was our first issue with potential ABI mismatches. We've since run into issues of how the type <4 x double> (for example) gets legalized on different X86 subtargets. Is it a pair of XMM registers, a single YMM, both, depends, neither? It appears this is an ongoing discussion on the mailing list.

Quuxplusone commented 6 years ago
That's the thread I started. :) Being one of the co-authors of this ABI and
part of the team that implemented this in ICC, I never had an issue in
interpreting the ABI. My question was more on the side of knowing how we need
to break up, where in LLVM we should be breaking up.

Beyond one vector is clearly defined in the vector function ABI.
https://software.intel.com/sites/default/files/managed/b4/c8/Intel-Vector-Function-ABI.pdf
It is based on the ISA class you are using.

GCC doesn't strictly follow the ISA class mangling scheme (and some other
things), but the ISA class concept still exists.

Using the following simple example,

#pragma omp declare simd simdlen(4)
double foo(double x) {
  return x + 1;
}

GCC and ICC with -vecabi=gcc generates

_ZGVdN4v_foo
_ZGVdM4v_foo
_ZGVcN4v_foo
_ZGVcM4v_foo
_ZGVbN4v_foo
_ZGVbM4v_foo

"M" are masked versions. Let's ignore them for simplicity.

_ZGVbN4v_foo() uses two XMMs.

# --- foo..bN4v(double)
_ZGVbN4v_foo:
# parameter 1: %xmm0
# parameter 2: %xmm1

_ZGVcN4v_foo() and _ZGVdN4v_foo() uses one YMM.

# --- foo..cN4v(double)
_ZGVcN4v_foo:
# parameter 1: %ymm0

# --- foo..dN4v(double)
_ZGVdN4v_foo:
# parameter 1: %ymm0

So, assuming that your target supports YMM, both two XMM and one YMM are legal.
From the caller side, GCC's vector function ABI says both XMM and YMM
interfaces are available. Call whichever is appropriate and make sure to use
the appropriately mangled name.

From Intel's SVML perspective (one of the Veclibs), we need to extend the
veclib table such that it includes the availability of entry point based on the
target ISA.

Conclusion of the RFC mail thread was that vectorizer needs to legalize the
call. As such, Veclib needs to provide enough info about what kind of entry
points are available for what target and what class of ISA is used for each
entry. ---- That's sufficient enough to legalize SVML and thus that's what we
are looking into.

You might need something similar.
Quuxplusone commented 6 years ago

Given that RFC concluded that vector type legalization of veclib is vectorizer's responsibility, vectorizer should also take care of legalizing <2 x i32> into <4 x i32>, not dependent on CG's legalizer fix for that. Thought I should clarify that. CG's legalizer should still be fixed for perf reasons, however.