Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Creating vector from casted operations is not vectorized #13912

Closed Quuxplusone closed 6 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR13837
Status RESOLVED FIXED
Importance P enhancement
Reported by Weiming Zhao (weimingz@codeaurora.org)
Reported on 2012-09-13 13:17:55 -0700
Last modified on 2018-01-23 14:06:28 -0800
Version trunk
Hardware PC Windows NT
CC llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR35732
When a vector is created by casting each element from another vector, it fails
to generate vectorized cast. For example:

int4 conv4i(float4 in)
{
  int4 out={ (int)in.x, (int)in.y, (int)in.z, (int)in.w};
  return out;
}
where int4 and float4 are vector data type.

LLVM generates scalar casts. For example, on ARM, the code look like:
conv4i:
    vmov    d1, r2, r3
    vmov    d0, r0, r1
    vcvt.s32.f32    s7, s3
    vcvt.s32.f32    s6, s2
    vcvt.s32.f32    s5, s1
    vcvt.s32.f32    s4, s0
    vmov    r0, r1, d2
    vmov    r2, r3, d3
    bx  lr
instead of:
conv4i:
    vmov    d17, r2, r3
    vmov    d16, r0, r1
    vcvt.s32.f32    q8, q8
    vmov    r0, r1, d16
    vmov    r2, r3, d17
    bx  lr

The reason is that, in DAGCombine, when visit BUILD_VECTOR, it fails to check
if all its elements come from the same cast. If it is, then it should do
BUILD_VECOTR first, followed by casting.

Take the above C code for example:
before DAGCombine, the DAG look like
                    <float x 4>
      /             |        |          \
 extract_0    extract_1  extract_2  extract_3
     |              |        |          |
   fp2int          fp2int  fp2int      fp2int
     \              |        |          /
                 BUILD_VECTOR(int x 4)

It should be folded into:
                   <float x 4>
      /             |        |          \
 extract_0    extract_1  extract_2  extract_3
      \              |        |          /
               BUILD_VECTOR(float x 4)
                       |
                     fp2int

Later, the extraction and BUILD_VECTOR will be cancelled each other.
Quuxplusone commented 6 years ago
Can we resolve this bug? This is handled in IR by SLP vectorization now.
Without SLP, we have:

define <4 x i32> @conv4i(<4 x float> %in) {
entry:
  %0 = extractelement <4 x float> %in, i64 0
  %conv = fptosi float %0 to i32
  %vecinit = insertelement <4 x i32> undef, i32 %conv, i32 0
  %1 = extractelement <4 x float> %in, i64 1
  %conv1 = fptosi float %1 to i32
  %vecinit2 = insertelement <4 x i32> %vecinit, i32 %conv1, i32 1
  %2 = extractelement <4 x float> %in, i64 2
  %conv3 = fptosi float %2 to i32
  %vecinit4 = insertelement <4 x i32> %vecinit2, i32 %conv3, i32 2
  %3 = extractelement <4 x float> %in, i64 3
  %conv5 = fptosi float %3 to i32
  %vecinit6 = insertelement <4 x i32> %vecinit4, i32 %conv5, i32 3
  ret <4 x i32> %vecinit6
}

And after:

$ ./opt -slp-vectorizer 13837.ll -S |grep fptosi
  %0 = fptosi <4 x float> %in to <4 x i32>
Quuxplusone commented 6 years ago

The bug was reported five years ago. Since it's not an issue now, we can close it.

Quuxplusone commented 6 years ago
I added a test for this example to prevent regression:
https://reviews.llvm.org/rL323269

Feel free to adjust the target for the test if I got that wrong.