Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Long compile time (infinite) due to isl coalescing #25715

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR25716
Status NEW
Importance P normal
Reported by Chris Jenneisch (chrisj@codeaurora.org)
Reported on 2015-12-02 12:35:07 -0800
Last modified on 2016-03-02 10:06:02 -0800
Version unspecified
Hardware PC Windows NT
CC jdoerfert@anl.gov, johannes@jdoerfert.de, llvm-bugs@lists.llvm.org, llvm@meinersbur.de, tobias@grosser.es
Fixed by commit(s)
Attachments compile_time.ll (10468 bytes, application/octet-stream)
compile_time.ll (9158 bytes, application/octet-stream)
Blocks
Blocked by
See also
Created attachment 15381
reduced test

Command:

opt  -polly-process-unprofitable -polly-code-generator=isl -polly-scops -
analyze compile_time.ll

Found that the quota is being hit in Scop::buildDomainsWithBranchConstraints in
the the coalescing of SuccDomain.

..
SuccDomain = isl_set_coalesce(SuccDomain);
..
Quuxplusone commented 8 years ago

Attached compile_time.ll (10468 bytes, application/octet-stream): reduced test

Quuxplusone commented 8 years ago

May be we could have a wrapper for the coalescing operations with a guard on the quota.

Quuxplusone commented 8 years ago

This is no problem of isl_set_coalesce (other than that it has a runtime squared in the number of basic sets) but in the number of basic sets. While debugging it I encountered up to 3584.

buildDomainsWithBranchConstraints calls isl_set_intersect and isl_set_coalesce in a loop. isl_set_intersect may increase the number of basic sets while isl_set_coalesce tries to reduce it. In this example, isl_set_intersect occasionally doubles the number of basic sets that isl_set_coalesce cannot reduce. Although isl_set_intersect and are both O(n^2), coalescing is more noticable because it does more work per pair of basic sets.

The subregion making trouble (sw.default => for.inc32) has only 10 basic blocks.

I already encountered this pattern of repeated intersections with exploding number of pieces when developing Molly. IMHO we need some algorithmic improvement here.

Quuxplusone commented 8 years ago

Attached compile_time.ll (9158 bytes, application/octet-stream): Further reduced test case

Quuxplusone commented 8 years ago

Without looking at the text, we could use dominance and post-dominance to handle quite a few domain building cases cheaper and only fall back to propagation if the CFG is not structured nicely.

Quuxplusone commented 8 years ago
(In reply to comment #4)
> Without looking at the text, we could use dominance and post-dominance to
> handle quite a few domain building cases cheaper and only fall back to
> propagation if the CFG is not structured nicely.

Right. However, to use post-dominance we may need to verify that the post-
dominance tree is properly maintained.
Quuxplusone commented 8 years ago
C-code that would compile to such IR (Might well be the result of unrolling):

for (int i=0; i<255; i+=1) {
  int x;
       if (i==A[0])
    x = 0;           /* sw.default */
  else if (i==A[1])
    x = 1;           /* for.cond18 */
  else if (i==A[2])
    x = 2;           /* for.cond18.1 */
  else if (i==A[3])
    x = 3;           /* for.cond18.2 */
  else if (i==A[4])
    x = 4;           /* for.cond18.3 */
  else if (i==A[5])
    x = 5;           /* for.cond18.4 */
  else if (i==A[6])
    x = 6;           /* for.cond18.5 */
  else if (i==A[7])
    x = 7;           /* for.cond18.6 */
  else if (i==A[8])
    x = 8;           /* for.cond18.7 */
  else
    continue;
  B[i] = AA[x];      /* if.then */
}

To share some ideas: I found three properties that we could exploit:

A) The BBs with complicated domains are dominated by BBs with less complicated
domains.

B) All 'for.cond18' BBs write the same data.

C) All conditions are affine.

Idea 1)
Using fact B) we can include the domains of the predecessors and execute the
predecessor afterwards. If the predecessor would follow the true branch,
execute it to overwrite the "wrong" result of its successor, i.e. invert the
execution chain. Each statement has only its own icmp in its domain. The domain
of 'if.then' is sufficiently simple as its on the 'eq' side of the conditions.
Unfortunately assumption B) is specific for this example and we'd still have
the same problem for other cases.

Idea 2)
A complexity limit

Idea 2a)
Once the domain gets too complicated (eg. number of basic sets after
coalescing), fall back to non-affine subregions with all ancestors and all
other blocks required to build such a region.
Unfortunately coalescing takes place in ScopInfo while ScopDetection decides
over non-affine subregions, modifying regions post-ScopDetection might be
complicated. It also does not really handle this case.

Idea 2b)
Teach ScopDetection a maximum number of affine conditions chained-up.

Idea 3)
Introduce a predicate that causes the statment to execute only if its dominator
statement has executed (exploiting property A), making it unnecessary to
include the dominator's execution condition into the domain. The difficult part
is, how to tell ISL about this relation.

Idea 3a)
Add new API to ISL to handle such constraints.

Idea 3b)
When the dominator excutes, set some flag. Statements can use this flag as
predicate in its domain and only execute when the flag is set. This adds
another scalar dependence between the statements. In this example, there is
already an output (WAW) dependence between them, but generally this would stop
instances to run in parallel.

Idea 3c)
Because of property C) we can compute the predicates in advance as soon as all
indvars they depend on are available, ie. most of the time header of the
surrounding loop. We still have scalar dependences that make parallelization
impossible, but the BBs could be moved freely within the loop when no other
dependencies stops them from doing so.

Idea 3d)
Use arrays of predicates instead, indexed by the indvar of the surrounding
loops; no guarantee the those could be optimized away afterwards.

Unfortunately property A) is also not universal for exponential blowup in
domains. E.g. the control graph:

        BB   BB
        | \ / |
        BB   BB
        | \ / |
        BB   BB
        | \ / |
          ...

doesn't really have dominators and the chain can be arbitrarily long with
exponential blowup, making DoS-"attacks" possible.

Idea 4)
Overapproximate the domain (eg. convex hull). Code generation re-inserts the
exact condition chain before executing the BB. This might also be an option for
some kinds of non-affine conditions.
Quuxplusone commented 8 years ago
(In reply to comment #6)
> C-code that would compile to such IR (Might well be the result of unrolling):
>
> for (int i=0; i<255; i+=1) {
>   int x;

There seems to be at least one unconditional store
to x.

>        if (i==A[0])
>     x = 0;           /* sw.default */
>   else if (i==A[1])
>     x = 1;           /* for.cond18 */
>   else if (i==A[2])
>     x = 2;           /* for.cond18.1 */
>   else if (i==A[3])
>     x = 3;           /* for.cond18.2 */
>   else if (i==A[4])
>     x = 4;           /* for.cond18.3 */
>   else if (i==A[5])
>     x = 5;           /* for.cond18.4 */
>   else if (i==A[6])
>     x = 6;           /* for.cond18.5 */
>   else if (i==A[7])
>     x = 7;           /* for.cond18.6 */
>   else if (i==A[8])
>     x = 8;           /* for.cond18.7 */

I do not think this is an if/else chain. If at every condition the false branch
is taken the code writes each time another value into x. Here always exactly one
value is written.

>   else
>     continue;
>   B[i] = AA[x];      /* if.then */
> }

if.then is never executed in your example.

Did you generate code for this and see if it reproduces the issue?

> To share some ideas: I found three properties that we could exploit:
>
> A) The BBs with complicated domains are dominated by BBs with less
> complicated domains.
>
> B) All 'for.cond18' BBs write the same data.
>
> C) All conditions are affine.
>
>
> Idea 1)
> Using fact B) we can include the domains of the predecessors and execute the
> predecessor afterwards. If the predecessor would follow the true branch,
> execute it to overwrite the "wrong" result of its successor, i.e. invert the
> execution chain. Each statement has only its own icmp in its domain. The
> domain of 'if.then' is sufficiently simple as its on the 'eq' side of the
> conditions. Unfortunately assumption B) is specific for this example and
> we'd still have the same problem for other cases.

This seems to be a rather specialized optimization. I think we can solve this
with a more generic approach.

>
> Idea 2)
> A complexity limit

I think we should add this in any case.

> Idea 2a)
> Once the domain gets too complicated (eg. number of basic sets after
> coalescing), fall back to non-affine subregions with all ancestors and all
> other blocks required to build such a region.
> Unfortunately coalescing takes place in ScopInfo while ScopDetection decides
> over non-affine subregions, modifying regions post-ScopDetection might be
> complicated. It also does not really handle this case.
>
> Idea 2b)
> Teach ScopDetection a maximum number of affine conditions chained-up.

Zino is planning to propose a patch to fall back to non-affine regions after a
maximal number of affine conditions.

> Idea 3)
> Introduce a predicate that causes the statment to execute only if its
> dominator statement has executed (exploiting property A), making it
> unnecessary to include the dominator's execution condition into the domain.
> The difficult part is, how to tell ISL about this relation.

This would result in changing isl's internal representation. Sven has been
thinking of this at some point, but this is probably a long-term thing. Let's
forget it for now.

> Idea 3a)
> Add new API to ISL to handle such constraints.
>
> Idea 3b)
> When the dominator excutes, set some flag. Statements can use this flag as
> predicate in its domain and only execute when the flag is set. This adds
> another scalar dependence between the statements. In this example, there is
> already an output (WAW) dependence between them, but generally this would
> stop instances to run in parallel.

> Idea 3c)
> Because of property C) we can compute the predicates in advance as soon as
> all indvars they depend on are available, ie. most of the time header of the
> surrounding loop. We still have scalar dependences that make parallelization
> impossible, but the BBs could be moved freely within the loop when no other
> dependencies stops them from doing so.
>
> Idea 3d)
> Use arrays of predicates instead, indexed by the indvar of the surrounding
> loops; no guarantee the those could be optimized away afterwards.
>
>
> Unfortunately property A) is also not universal for exponential blowup in
> domains. E.g. the control graph:
>
>         BB   BB
>         | \ / |
>         BB   BB
>         | \ / |
>         BB   BB
>         | \ / |
>           ...
>
> doesn't really have dominators and the chain can be arbitrarily long with
> exponential blowup, making DoS-"attacks" possible.

Right. For such cases (we should write a test case), it is probably important
to just bail out/approximate. However, I doubt they arise often in practice.

> Idea 4)
> Overapproximate the domain (eg. convex hull). Code generation re-inserts the
> exact condition chain before executing the BB. This might also be an option
> for some kinds of non-affine conditions.

Zino mentioned this as well. Over-approximation is likely to cause problems due
possibly covering dependences in our data-flow analysis or due to additional
assumptions being triggered due to the overly large iteration space. It is good
to keep this in mind, but I would like to avoid this.

Even though the example code of michael seems not fully correct, having such a
C snippet seems a great start to discuss this topic. I think the first step
would be
to manually derive the domains and writing them in a minimal representation.
The result should be something like:

sw.default:   true
for.cond18:   i != A[0]
for.cond18.1: i != A[0] and i != A[1]
...
for.cond18.7: i != A[0] and i != A[1] and ... and i != A[7]
if.then: (i == A[0] or i == A[1] or ... or i == A[8)
Quuxplusone commented 8 years ago

In r256123 I committed a "fix" that detects (at scopinfo construction time) overly complex scops and just bails out. This should avoid the compile-time issues you are seeing.

It might still be interesting if for the test case provided here, we can generate the domains more efficiently. Also, as Zino mentioned, we may want to over-approximate such scops in the scop detection to be able to handle them anyhow.

Quuxplusone commented 8 years ago

Move bugs to Polly product.

Quuxplusone commented 8 years ago
(In reply to comment #7)
> (In reply to comment #6)
> > C-code that would compile to such IR (Might well be the result of
unrolling):
> >
> > for (int i=0; i<255; i+=1) {
> >   int x;
>
> There seems to be at least one unconditional store
> to x.

It's in sw.default which will be overwritten in any of the for.cond18 blocks.

> >        if (i==A[0])
> >     x = 0;           /* sw.default */
> >   else if (i==A[1])
> >     x = 1;           /* for.cond18 */
> >   else if (i==A[2])
> >     x = 2;           /* for.cond18.1 */
> >   else if (i==A[3])
> >     x = 3;           /* for.cond18.2 */
> >   else if (i==A[4])
> >     x = 4;           /* for.cond18.3 */
> >   else if (i==A[5])
> >     x = 5;           /* for.cond18.4 */
> >   else if (i==A[6])
> >     x = 6;           /* for.cond18.5 */
> >   else if (i==A[7])
> >     x = 7;           /* for.cond18.6 */
> >   else if (i==A[8])
> >     x = 8;           /* for.cond18.7 */
>
> I do not think this is an if/else chain. If at every condition the false
> branch
> is taken the code writes each time another value into x. Here always exactly
> one
> value is written.
>
> >   else
> >     continue;
> >   B[i] = AA[x];      /* if.then */
> > }
>
> if.then is never executed in your example.

Sure it is. For every case except the last which has a continue.

>
> Did you generate code for this and see if it reproduces the issue?

One could write the same code as

void func(long long A[static const restrict 9], long long AA[static const
restrict 9], long long B[static const restrict 255]) {
  for (long long i=0; i<255; i+=1) {
    long long x;

    x = 0;         /* sw.default */
    if (i!=A[0]) {
      x = 1;       /* for.cond18 */
    if (i!=A[1]) {
      x = 2;       /* for.cond18.1 */
    if (i!=A[2]) {
      x = 3;       /* for.cond18.2 */
    if (i!=A[3]) {
      x = 4;       /* for.cond18.3 */
    if (i!=A[4]) {
      x = 5;       /* for.cond18.4 */
    if (i!=A[5]) {
      x = 6;       /* for.cond18.5 */
    if (i!=A[6]) {
      x = 7;       /* for.cond18.6 */
    if (i!=A[7]) {
      x = 8;       /* for.cond18.7 */
    if (i!=A[8]) {
      continue;
    }}}}}}}}}

    B[i] = AA[x];      /* if.then */
  }
}

While searching direct dependencies, this makes no difference. It is the
conditions that cause the problem, not what is executed in the block. My
example was to show what the code is doing (AFAIK they are functionally
equivalent), not to generate exactly the same IR. I could reproduce the compile-
time blow-up with both versions:

$ clang ctime.c -S -emit-llvm -std=c99
$ opt ctime.ll -gvn -licm -S  -o ctimeopt.ll
$ opt ctimeopt.ll -polly-process-unprofitable -polly-scops -analyze -polly-
allow-nonaffine