Open Quuxplusone opened 8 years ago
Thanks to Miguel for reporting.
_Bug 32532 has been marked as a duplicate of this bug._
Hello Michael,
Would've Polly synthesized a parallel loop if it had known that the "b" isn't reassigned within its scope, and therefore is a constant ? Is there a problem with Polly not being able to infer this ?
The issue seems to stem from that b is defined within the SCoP. Also see the duplicate bug bugs.llvm.org/PR32532 where the lower bound is also defined within the SCoP, but should have been load-hoisted.
The culprit in both cases is that the lower bound is not a synthesizable value (see function canSynthesize). Load-hoisted values are currently not seen as synthesizable.
So yes, it would have been parallel if the lower bound is defined outside of the SCoP (which would make it synthesizable).
Interestingly, there seems to be no issue with upper bounds, only with lower bounds.
Had the SCoP associated with the inner for-loop not removed during SCoP expansion, would the SCEV for "b" considered synthesizeable when the loop is processed from within this smaller SCoP ?
Had the SCoP that contained just the inner for-loop not been removed during SCoP expansion, would the SCEV for the lowerbound be synthesizable if the loop were processed from within this smaller SCoP ?
Thie SCoP containing the inner for-loop is contained within a larger SCoP (which also contains the outer for-loop). It is not removed, just not a maximal SCoP.
You can try yourself what happens without the outer for-loop: Remove the loop and assign some or parameter to j. Because it is constant (or a parameter), it is considered synthesizable in this case.
(In reply to Michael Kruse from comment #7)
> Thie SCoP containing the inner for-loop is contained within a larger SCoP
> (which also contains the outer for-loop). It is not removed, just not a
> maximal SCoP.
Then in that case, when the inner for-loop is processed when it's a part of the
SCoP containing just the inner for-loop wouldn't the lower bound's SCEV,
including the 'b' variable which isn't defined in the inner for-loop and hence
a constant, be considered a synthesizable ?
> You can try yourself what happens without the outer for-loop: Remove the
> loop and assign some or parameter to j. Because it is constant (or a
> parameter), it is considered synthesizable in this case.
There doesn't seem to be a big difference. The following are the commands and
their outputs that lead me to this conclusion.
$ cat no_par.c
void func(int lastrow, int firstrow, int firstcol, int rowstr[], int colidx[])
{
for (int j = 1; j <= lastrow - firstrow + 1; j++) {
int e = rowstr[j + 1];
int b = rowstr[j];
for (int k = b; k < e; k++)
{
colidx[k] = colidx[k] - firstcol + 1;
}
}
}
$ cat no_par_onlyInner.c
void func(int lastrow, int firstrow, int firstcol, int rowstr[], int colidx[])
{
//for (int j = 1; j <= lastrow - firstrow + 1; j++) {
int j=1;
int e = rowstr[j + 1];
int b = rowstr[j];
for (int k = b; k < e; k++)
{
colidx[k] = colidx[k] - firstcol + 1;
}
//}
}
$ clang -S -emit-llvm no_par.c
$ clang -S -emit-llvm no_par_onlyInner.c
$ opt-polly -O3 -polly-parallel -polly-process-unprofitable -polly-ast -analyze
no_par.ll
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => .loopexit.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => .loopexit' in function 'func':
:: isl ast :: func :: %.lr.ph---%.loopexit
if (1)
#pragma simd
#pragma known-parallel
for (int c0 = 0; c0 < -indvars_iv_ph + p_1; c0 += 1)
Stmt__lr_ph(c0);
else
{ /* original code */ }
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'middle.block.unr-lcssa => middle.block' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.body => middle.block.unr-lcssa.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.ph => middle.block.unr-lcssa' in function 'func':
:: isl ast :: func :: %vector.ph---%middle.block.unr-lcssa
if (1 && 0 == p_1 + 7 >= p_0)
{
Stmt_vector_ph();
#pragma minimal dependence distance: 1
for (int c0 = 0; c0 < floord(p_0 - p_1, 16); c0 += 1)
Stmt_vector_body(c0);
if (p_0 >= p_1 + 16)
Stmt_middle_block_unr-lcssa_loopexit();
}
else
{ /* original code */ }
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph.preheader => .loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'%12 => .loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'%12 => ._crit_edge.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> ._crit_edge' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> <Function Return>' in function 'func':
$ opt-polly -O3 -polly-parallel -polly-process-unprofitable -polly-ast -analyze
no_par_onlyInner.preopt.ll
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'scalar.ph => ._crit_edge.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'scalar.ph => ._crit_edge' in function 'func':
:: isl ast :: func :: %scalar.ph---%._crit_edge
if (1)
#pragma simd
#pragma known-parallel
for (int c0 = 0; c0 < -indvars_iv_ph + p_1; c0 += 1)
Stmt_scalar_ph(c0);
else
{ /* original code */ }
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'middle.block.unr-lcssa => middle.block' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.body => middle.block.unr-lcssa.loopexit' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'vector.ph => middle.block.unr-lcssa' in function 'func':
:: isl ast :: func :: %vector.ph---%middle.block.unr-lcssa
if (1 && 0 == p_1 + 7 >= p_0)
{
Stmt_vector_ph();
#pragma minimal dependence distance: 1
for (int c0 = 0; c0 < floord(p_0 - p_1, 16); c0 += 1)
Stmt_vector_body(c0);
if (p_0 >= p_1 + 16)
Stmt_middle_block_unr-lcssa_loopexit();
}
else
{ /* original code */ }
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => ._crit_edge' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> ._crit_edge' in function 'func':
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region: '%5
=> <Function Return>' in function 'func':
Then I might have been wrong.
When b
is not a parameter then, what is it?
Attached no_par.preopt.ll
(3624 bytes, text/plain): pre-optimization file generated by clang -emit-llvm -S no_par.c
Attached yes_par.c
(270 bytes, text/x-csrc): pre-optimization file generated by clang -emit-llvm -S yes_par.c
Attached patch
(1651 bytes, text/plain): canSynthesize.polly.patch
Attached no_par.postopt.ll.ll
(18395 bytes, text/plain): no_par.postoptpolly.ll : Polly optimized no_par.preopt.ll
Attached yes_par.postopt.ll.ll
(5286 bytes, text/plain): yes_par.postoptpolly.ll : Polly optimized yes_par.preopt.ll
Attached yes_par_canSynth
(4773 bytes, text/plain): Output of opt -O3 -polly -polly-parallel -polly-process-unprofitable -polly-ast -analyze yes_par.preopt.ll
with modif to canSynthesize
Attached no_par_canSynth
(7099 bytes, text/plain): Output of opt -O3 -polly -polly-parallel -polly-process-unprofitable -polly-ast -analyze no_par.preopt.ll
with modif to canSynthesize
Attached no_par.O3.ll
(7783 bytes, text/plain): -O3 optimized no_par.preopt.ll
Attached yes_par.O3.ll
(5275 bytes, text/LLVM-IR): -O3 optimized yes_par.preopt.ll
Attached yes_par.c
(270 bytes, text/x-csrc): File with parallel loops detected by Polly
Attached no_par.c
(265 bytes, text/x-csrc): File with parallel loops not detected by Polly
Even if PollyPosition is POSITION_EARLY by default, Polly works on the IR
generated by passes in ``opt -O3``. This can be confirmed by ``diff
yes_par.O3.ll yes_par.postopt.ll``. This also confirms that Polly isn't
modifying the code even when it detects parallelism, in the case of yes_par.c.
(In reply to Michael Kruse from comment #0)
> This inner loops is not detected as parallel (-mllvm
> -polly-process-unprofitable -mllvm -polly-parallel)
Given that Polly works on the code generated by -O3 passes, it's not just the
code that determines if Polly is able to detect parallelism.
I'd like to highlight few oddities I had observed,
1. no_par.preopt.ll was actually (minimally) modified by Polly yet
yes_par.preopt.ll isn't touched even when Polly detects parallelism
2. no_par.O3.ll had its inner loop unrolled (and vectorized) to steps of 16 (4
instructions working on 4 element vectors) whereas yes_par.O3.ll was only
unrolled to steps of 8 ( 2 insts working on 4 element vectors)
3. Polly identifies a SCoP between .lr.ph and .loopexit and declares it as simd
and know-parallel in both no_par and yes_par.
Printing analysis 'Polly - Generate an AST from the SCoP (isl)' for region:
'.lr.ph => .loopexit' in function 'func':
:: isl ast :: func :: %.lr.ph---%.loopexit
if (1 && 0 == ((p_2 >= p_1 + 2147483648 && p_2 >= indvars_iv_ph + p_1 + 1) || (p_2 >= indvars_iv_ph + p_1 + 1 && p_1 >= p_2 + 1) || (p_1 == -2147483648 && p_2 + 2147483647 >= indvars_iv_ph)))
#pragma simd
#pragma known-parallel
for (int c0 = 0; c0 < -indvars_iv_ph - p_1 + p_2; c0 += 1)
Stmt__lr_ph(c0);
else
{ /* original code */ }
.loopexit happens to be at line 25 and .lr.ph at 138 (in no_par.O3.ll) or lines close by. There's loop at .lr.ph which indirectly jumps to .loopexit when the exit condition evaluates to true. Does this mean that SCoP needn't be continuous in text ?
yes_par_canSynth
(4773 bytes, text/plain)no_par_canSynth
(7099 bytes, text/plain)no_par.preopt.ll
(3624 bytes, text/plain)yes_par.c
(270 bytes, text/x-csrc)patch
(1651 bytes, text/plain)no_par.postopt.ll.ll
(18395 bytes, text/plain)yes_par.postopt.ll.ll
(5286 bytes, text/plain)no_par.O3.ll
(7783 bytes, text/plain)yes_par.O3.ll
(5275 bytes, text/LLVM-IR)yes_par.c
(270 bytes, text/x-csrc)no_par.c
(265 bytes, text/x-csrc)