Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Loop-idiom recognition for memset in the inner-loop of a nested-loop interferes with vectorization #31826

Open Quuxplusone opened 7 years ago

Quuxplusone commented 7 years ago
Bugzilla Link PR32854
Status NEW
Importance P enhancement
Reported by Bryce Adelstein Lelbach aka wash (brycelelbach@gmail.com)
Reported on 2017-04-28 22:18:24 -0700
Last modified on 2017-04-29 14:48:30 -0700
Version trunk
Hardware PC Linux
CC hfinkel@anl.gov, llvm-bugs@lists.llvm.org, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments llvm_memset_loop_idiom_vectorization_interference.cpp (2637 bytes, text/x-c++src)
llvm_memset_loop_idiom_vectorization_interference.ir (10844 bytes, text/plain)
llvm_memset_loop_idiom_vectorization_interference_irc_discussion.txt (3491 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 18382
Reduced Test Case

Compilation options, build environment, etc are documented in the attached file
and here:

https://wandbox.org/permlink/o06VeIxCKC1qIhUh

Summary: We have a nested loop like this (where A is a double* __restrict__):

    for (ptrdiff_t j = 0; j != N; ++j)
        for (ptrdiff_t i = 0; i != N; ++i)
            A[i + j * N] = 0.0F;

Loop-idiom recognition determines that it can replace the inner loop with
memset, turning the code into:

    for (ptrdiff_t j = 0; j != N; ++j)
        std::memset(A + j * N, 0, sizeof(double) * N); // e.g. @llvm.memset

Later, the vectorizer sees this code and decides to bail out because it cannot
vectorize the inserted call to @llvm.memset.

I have so many questions here :)

0.) The diagnostic that the vectorizer pass remarks give is not very helpful:
'call instruction cannot be vectorized', BUT the source location it points to
isn't a call - it's the users original code. Many users may not divine the fact
that loop-idiom replacement occured and end up fruitfully trying to figure out
why assignment to double (the source location pointed to) is a call that cannot
be vectorized. At the very least, the pass remark (emitted from here:
https://github.com/llvm-
mirror/llvm/blob/master/lib/Transforms/Vectorize/LoopVectorize.cpp#L5422) could
give the name of the function in the function call that could not be vectorized
(which I assume would be something like "memset" or "@llvm.memset" in this
case).

1.) Why is there not a vector version of @llvm.memset in addition to the scalar
version? Is this a problem with the underlying C library on my target (x86
Linux)?

2.) Why does the vectorizer give up when it encounters a scalar function call?
If the function is noexcept, it should be able to take something like this:

    // Assume A is an cache-line aligned double* __restrict__
    // and N is divisible by some nice number, say 32.
    for (ptrdiff_t i = 0; i != N; ++i)
    {
        double tmp = scalar_noexcept_f(i);
        A[i] += B[i] * tmp;
    }

And turn it into something like this:

    // Assume A is an cache-line aligned double* __restrict__
    // and N is divisible by some nice number, say 32.
    for (ptrdiff_t i = 0; i != N; i += 8)
    {
        // Vectorize "around" the scalar call.
        __m512d tmp = _mm512_set_pd(
            scalar_noexcept_f(i)
          , scalar_noexcept_f(i+1)
          , scalar_noexcept_f(i+2)
          , scalar_noexcept_f(i+3)
          , scalar_noexcept_f(i+4)
          , scalar_noexcept_f(i+5)
          , scalar_noexcept_f(i+6)
          , scalar_noexcept_f(i+7)
        );

        _mm512_store_pd(
            A + i
          , _mm512_fmadd_pd(
                _mm512_load_pd(A + i)
              , _mm512_load_pd(B + i)
              , tmp
            )
        );
    }

3.) Why isn't loop-idiom recognition "nested loop aware"? In this case, my
nested loops could be turned into a single memset.
Quuxplusone commented 7 years ago

Attached llvm_memset_loop_idiom_vectorization_interference.cpp (2637 bytes, text/x-c++src): Reduced Test Case

Quuxplusone commented 7 years ago

Attached llvm_memset_loop_idiom_vectorization_interference.ir (10844 bytes, text/plain): Reduced Test Case - Generated IR

Quuxplusone commented 7 years ago

Attached llvm_memset_loop_idiom_vectorization_interference_irc_discussion.txt (3491 bytes, text/plain): IRC discussion of this issue