Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Scalar dependence on induction variable when lower bound is not synthesizable #28866

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR28871
Status NEW
Importance P normal
Reported by Michael Kruse (llvm@meinersbur.de)
Reported on 2016-08-05 10:35:35 -0700
Last modified on 2017-07-16 03:17:53 -0700
Version unspecified
Hardware PC Windows NT
CC fitzpatrick@silexica.com, llvm-bugs@lists.llvm.org, singapuram.sanjay@gmail.com, tobias@grosser.es
Fixed by commit(s)
Attachments yes_par_canSynth (4773 bytes, text/plain)
no_par_canSynth (7099 bytes, text/plain)
no_par.preopt.ll (3624 bytes, text/plain)
yes_par.c (270 bytes, text/x-csrc)
patch (1651 bytes, text/plain)
no_par.postopt.ll.ll (18395 bytes, text/plain)
yes_par.postopt.ll.ll (5286 bytes, text/plain)
no_par.O3.ll (7783 bytes, text/plain)
yes_par.O3.ll (5275 bytes, text/LLVM-IR)
yes_par.c (270 bytes, text/x-csrc)
no_par.c (265 bytes, text/x-csrc)
Blocks
Blocked by
See also
This inner loops is not detected as parallel (-mllvm -polly-process-
unprofitable -mllvm -polly-parallel):

for (int j = 1; j <= lastrow - firstrow + 1; j++) {
    int e = rowstr[j + 1];
    int b = rowstr[j];
     for (int k = b; k < e; k++) {
         colidx[k] = colidx[k] - firstcol + 1;
     }
}

AST output:
      Stmt_for_body();
      if (p_1_loaded_from_rowstr >= p_0_loaded_from_rowstr + 1)
        Stmt_for_body7_lr_ph();
      #pragma minimal dependence distance: 1
      for (int c0 = 0; c0 < -p_0_loaded_from_rowstr + p_1_loaded_from_rowstr; c0 += 1)
        Stmt_for_body7(c0);

Whereas this one is:

for (int j = 1; j <= lastrow - firstrow + 1; j++) {
    int e = rowstr[j + 1];
    int b = rowstr[j];
     for (int k = 0; k < e-b; k++)
     {
         colidx[k+b] = colidx[k+b] - firstcol + 1;
     }
}

AST output:
      Stmt_for_body();
      if (p_1_loaded_from_rowstr >= p_0_loaded_from_rowstr + 1)
        Stmt_for_body8_lr_ph();
      #pragma simd
      #pragma known-parallel
      for (int c0 = 0; c0 < -p_0_loaded_from_rowstr + p_1_loaded_from_rowstr; c0 += 1)
        Stmt_for_body8(c0);

This is cause by the lower bound ('b') appearing in the lower bound of the SCEV
for the induction variable. It is not considered synthesizable albeit load-
hoisted. Thus causing scalar dependences created for its PHINode.
Quuxplusone commented 8 years ago

Thanks to Miguel for reporting.

Quuxplusone commented 7 years ago

_Bug 32532 has been marked as a duplicate of this bug._

Quuxplusone commented 7 years ago

Hello Michael,

Would've Polly synthesized a parallel loop if it had known that the "b" isn't reassigned within its scope, and therefore is a constant ? Is there a problem with Polly not being able to infer this ?

Quuxplusone commented 7 years ago

The issue seems to stem from that b is defined within the SCoP. Also see the duplicate bug bugs.llvm.org/PR32532 where the lower bound is also defined within the SCoP, but should have been load-hoisted.

The culprit in both cases is that the lower bound is not a synthesizable value (see function canSynthesize). Load-hoisted values are currently not seen as synthesizable.

So yes, it would have been parallel if the lower bound is defined outside of the SCoP (which would make it synthesizable).

Interestingly, there seems to be no issue with upper bounds, only with lower bounds.

Quuxplusone commented 7 years ago

Had the SCoP associated with the inner for-loop not removed during SCoP expansion, would the SCEV for "b" considered synthesizeable when the loop is processed from within this smaller SCoP ?

Quuxplusone commented 7 years ago

Had the SCoP that contained just the inner for-loop not been removed during SCoP expansion, would the SCEV for the lowerbound be synthesizable if the loop were processed from within this smaller SCoP ?

Quuxplusone commented 7 years ago

Thie SCoP containing the inner for-loop is contained within a larger SCoP (which also contains the outer for-loop). It is not removed, just not a maximal SCoP.

You can try yourself what happens without the outer for-loop: Remove the loop and assign some or parameter to j. Because it is constant (or a parameter), it is considered synthesizable in this case.

Quuxplusone commented 7 years ago
(In reply to Michael Kruse from comment #7)
> Thie SCoP containing the inner for-loop is contained within a larger SCoP
> (which also contains the outer for-loop). It is not removed, just not a
> maximal SCoP.
Then in that case, when the inner for-loop is processed when it's a part of the
SCoP containing just the inner for-loop wouldn't the lower bound's SCEV,
including the 'b' variable which isn't defined in the inner for-loop and hence
a constant, be considered a synthesizable ?
> You can try yourself what happens without the outer for-loop: Remove the
> loop and assign some or parameter to j. Because it is constant (or a
> parameter), it is considered synthesizable in this case.

There doesn't seem to be a big difference. The following are the commands and
their outputs that lead me to this conclusion.
$ cat no_par.c
void func(int lastrow, int firstrow, int firstcol, int rowstr[], int colidx[])
{
    for (int j = 1; j <= lastrow - firstrow + 1; j++) {
        int e = rowstr[j + 1];
        int b = rowstr[j];
        for (int k = b; k < e; k++)
        {
            colidx[k] = colidx[k] - firstcol + 1;
        }
    }
}
$ cat no_par_onlyInner.c
void func(int lastrow, int firstrow, int firstcol, int rowstr[], int colidx[])
{
    //for (int j = 1; j <= lastrow - firstrow + 1; j++) {
    int j=1;
        int e = rowstr[j + 1];
        int b = rowstr[j];
        for (int k = b; k < e; k++)
        {
            colidx[k] = colidx[k] - firstcol + 1;
        }
    //}
}
$ clang -S -emit-llvm no_par.c
$ clang -S -emit-llvm no_par_onlyInner.c
$ opt-polly -O3 -polly-parallel -polly-process-unprofitable -polly-ast -analyze
no_par.ll
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => .loopexit.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => .loopexit' in function 'func':
:: isl ast :: func :: %.lr.ph---%.loopexit

if (1)

    #pragma simd
    #pragma known-parallel
    for (int c0 = 0; c0 < -indvars_iv_ph + p_1; c0 += 1)
      Stmt__lr_ph(c0);

else
    {  /* original code */ }

Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'middle.block.unr-lcssa => middle.block' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.body => middle.block.unr-lcssa.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.ph => middle.block.unr-lcssa' in function 'func':
:: isl ast :: func :: %vector.ph---%middle.block.unr-lcssa

if (1 && 0 == p_1 + 7 >= p_0)

    {
      Stmt_vector_ph();
      #pragma minimal dependence distance: 1
      for (int c0 = 0; c0 < floord(p_0 - p_1, 16); c0 += 1)
        Stmt_vector_body(c0);
      if (p_0 >= p_1 + 16)
        Stmt_middle_block_unr-lcssa_loopexit();
    }

else
    {  /* original code */ }

Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph.preheader => .loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'%12 => .loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'%12 => ._crit_edge.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> ._crit_edge' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> <Function Return>' in function 'func':
$ opt-polly -O3 -polly-parallel -polly-process-unprofitable -polly-ast -analyze
no_par_onlyInner.preopt.ll
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'scalar.ph => ._crit_edge.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'scalar.ph => ._crit_edge' in function 'func':
:: isl ast :: func :: %scalar.ph---%._crit_edge

if (1)

    #pragma simd
    #pragma known-parallel
    for (int c0 = 0; c0 < -indvars_iv_ph + p_1; c0 += 1)
      Stmt_scalar_ph(c0);

else
    {  /* original code */ }

Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'middle.block.unr-lcssa => middle.block' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.body => middle.block.unr-lcssa.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.ph => middle.block.unr-lcssa' in function 'func':
:: isl ast :: func :: %vector.ph---%middle.block.unr-lcssa

if (1 && 0 == p_1 + 7 >= p_0)

    {
      Stmt_vector_ph();
      #pragma minimal dependence distance: 1
      for (int c0 = 0; c0 < floord(p_0 - p_1, 16); c0 += 1)
        Stmt_vector_body(c0);
      if (p_0 >= p_1 + 16)
        Stmt_middle_block_unr-lcssa_loopexit();
    }

else
    {  /* original code */ }

Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => ._crit_edge' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> ._crit_edge' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> <Function Return>' in function 'func':
Quuxplusone commented 7 years ago

Then I might have been wrong.

When b is not a parameter then, what is it?

Quuxplusone commented 7 years ago

Attached no_par.preopt.ll (3624 bytes, text/plain): pre-optimization file generated by clang -emit-llvm -S no_par.c

Quuxplusone commented 7 years ago

Attached yes_par.c (270 bytes, text/x-csrc): pre-optimization file generated by clang -emit-llvm -S yes_par.c

Quuxplusone commented 7 years ago

Attached patch (1651 bytes, text/plain): canSynthesize.polly.patch

Quuxplusone commented 7 years ago

Attached no_par.postopt.ll.ll (18395 bytes, text/plain): no_par.postoptpolly.ll : Polly optimized no_par.preopt.ll

Quuxplusone commented 7 years ago

Attached yes_par.postopt.ll.ll (5286 bytes, text/plain): yes_par.postoptpolly.ll : Polly optimized yes_par.preopt.ll

Quuxplusone commented 7 years ago

Attached yes_par_canSynth (4773 bytes, text/plain): Output of opt -O3 -polly -polly-parallel -polly-process-unprofitable -polly-ast -analyze yes_par.preopt.ll with modif to canSynthesize

Quuxplusone commented 7 years ago

Attached no_par_canSynth (7099 bytes, text/plain): Output of opt -O3 -polly -polly-parallel -polly-process-unprofitable -polly-ast -analyze no_par.preopt.ll with modif to canSynthesize

Quuxplusone commented 7 years ago

Attached no_par.O3.ll (7783 bytes, text/plain): -O3 optimized no_par.preopt.ll

Quuxplusone commented 7 years ago

Attached yes_par.O3.ll (5275 bytes, text/LLVM-IR): -O3 optimized yes_par.preopt.ll

Quuxplusone commented 7 years ago

Attached yes_par.c (270 bytes, text/x-csrc): File with parallel loops detected by Polly

Quuxplusone commented 7 years ago

Attached no_par.c (265 bytes, text/x-csrc): File with parallel loops not detected by Polly

Quuxplusone commented 7 years ago
Even if PollyPosition is POSITION_EARLY by default, Polly works on the IR
generated by passes in ``opt -O3``. This can be confirmed by ``diff
yes_par.O3.ll yes_par.postopt.ll``. This also confirms that Polly isn't
modifying the code even when it detects parallelism, in the case of yes_par.c.

(In reply to Michael Kruse from comment #0)
> This inner loops is not detected as parallel (-mllvm
> -polly-process-unprofitable -mllvm -polly-parallel)

Given that Polly works on the code generated by -O3 passes, it's not just the
code that determines if Polly is able to detect parallelism.

I'd like to highlight few oddities I had observed,
1. no_par.preopt.ll was actually (minimally) modified by Polly yet
yes_par.preopt.ll isn't touched even when Polly detects parallelism
2. no_par.O3.ll had its inner loop unrolled (and vectorized) to steps of 16 (4
instructions working on 4 element vectors) whereas yes_par.O3.ll was only
unrolled to steps of 8 ( 2 insts working on 4 element vectors)
3. Polly identifies a SCoP between .lr.ph and .loopexit and declares it as simd
and know-parallel in both no_par and yes_par.
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => .loopexit' in function 'func':
:: isl ast :: func :: %.lr.ph---%.loopexit

 if (1 && 0 == ((p_2 >= p_1 + 2147483648 && p_2 >= indvars_iv_ph + p_1 + 1) || (p_2 >= indvars_iv_ph + p_1 + 1 && p_1 >= p_2 + 1) || (p_1 == -2147483648 && p_2 + 2147483647 >= indvars_iv_ph)))

     #pragma simd
     #pragma known-parallel
     for (int c0 = 0; c0 < -indvars_iv_ph - p_1 + p_2; c0 += 1)
       Stmt__lr_ph(c0);

 else
     {  /* original code */ }
 .loopexit happens to be at line 25 and .lr.ph at 138 (in no_par.O3.ll) or lines close by. There's loop at .lr.ph which indirectly jumps to .loopexit when the exit condition evaluates to true. Does this mean that SCoP needn't be continuous in text ?