[ppc] crandc of two cmps is much slower than two cmp/br

Quuxplusone commented 7 years ago


Bugzilla Link	PR32320
Status	NEW
Importance	P enhancement
Reported by	Carrot (carrot@google.com)
Reported on	2017-03-16 16:58:05 -0700
Last modified on	2017-03-20 16:29:54 -0700
Version	trunk
Hardware	PC All
CC	echristo@gmail.com, hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com
Fixed by commit(s)
Attachments	`cond_test_v1.tar.gz` (3095 bytes, application/gzip) `cond_test2_v1.tar.gz` (3791 bytes, application/gzip) `cond_test3_v1.tar.gz` (4003 bytes, application/gzip)
Blocks
Blocked by
See also

Compile following source code with options -mvsx -mcpu=power8 -O2

struct S{
  struct S* next;
};

struct S* foo(struct S* p, struct S* e)
{
  struct S* n;
  for (; (n = p->next) !=0 && n < e; p = n);
  return n;
}

LLVM generates:

foo:                                    # @foo
.Lfunc_begin0:
# BB#0:                                 # %entry
    .p2align    5
.LBB0_1:                                # %for.cond
                                        # =>This Inner Loop Header: Depth=1
    ld 3, 0(3)
    cmpdi    3, 0            // (n = p->next) !=0
    cmpld 1, 3, 4            // n < e
    crandc 20, 4, 2          //    &&
    bc 12, 20, .LBB0_1
# BB#2:                                 # %for.end
    blr

It uses crandc to combine the previous two cmp results, and then branch. It is
much slower than 2 simple cmp/br on power8, because crandc has data dependence
on previous cmp instructions, they have 5 cycles latency. In an internal
bigtable benchmark, the crandc version has only half of the performance of the
simple cmp/br version, with a similar code pattern.

Quuxplusone commented 7 years ago

Took a bit of a look at this tonight, it ends up being interesting to turn this off and into branches. Basically you need to disable most of the condition code optimizations in selection dag. It might be worth it though on platforms that don't have fast compare and condition code registers.

Hal: IIRC the BQ was very fast here, are there any other similar platforms you know about? Ultimately, I guess, think this will do what I expect? :)

Right now I'm turning it off unless we've asked for code size optimizations (because it's still useful there) and I'll benchmark it in the morning.

Quuxplusone commented 7 years ago

(Or, I guess, just conditionally turn off crbits...)

Quuxplusone commented 7 years ago

(In reply to Eric Christopher from comment #2)
> (Or, I guess, just conditionally turn off crbits...)

Yea. On embedded-style hardware this can make sense, although in this case the
early return might help there too (if you're likely to take it). This is not
the first time we've seen this kind of thing (at least on the P8). It seems
pretty clear that crbits are not useful here for two conditions anded/ored
together before a branch. What we need to understand is how universal this is --
 for example, if I have 10 conditions anded together, do I still want the branches, or does the branch predictor, etc. start to have problems after awhile?

To make a more-general point, one the reasons crbits help is that, if you need
to keep booleans in registers, then you reduce register pressure by keeping the
values in the cr register. This does not apply here (you wouldn't be keeping
the boolean values in registers regardless). Another reason is that it keeps
otherwise-larger basic blocks together and that helps with instruction
selection and scheduling. This also does not apply here (this basic block is
small anyway - and as it turns out at least, whether or not the block is split
does not really affect instruction selection).

Quuxplusone commented 7 years ago

(In reply to Hal Finkel from comment #3)
> (In reply to Eric Christopher from comment #2)
> > (Or, I guess, just conditionally turn off crbits...)
>
> Yea. On embedded-style hardware this can make sense, although in this case
> the early return might help there too (if you're likely to take it). This is
> not the first time we've seen this kind of thing (at least on the P8). It
> seems pretty clear that crbits are not useful here for two conditions
> anded/ored together before a branch. What we need to understand is how
> universal this is -- for example, if I have 10 conditions anded together, do
> I still want the branches, or does the branch predictor, etc. start to have
> problems after awhile?
>
> To make a more-general point, one the reasons crbits help is that, if you
> need to keep booleans in registers, then you reduce register pressure by
> keeping the values in the cr register. This does not apply here (you
> wouldn't be keeping the boolean values in registers regardless). Another
> reason is that it keeps otherwise-larger basic blocks together and that
> helps with instruction selection and scheduling. This also does not apply
> here (this basic block is small anyway - and as it turns out at least,
> whether or not the block is split does not really affect instruction
> selection).

So there are 2 situations cr manipulation instructions can help.
  1 the branches are unpredictable
  2 there will be large basic block after using cr manipulation instructions, so there are many instructions can be scheduled between cr instructions to hide latency of cr operations, usually it means the large basic block contains the first cmp instruction, all later ones are simple cmp/br blocks.

But the common sense is:
  1 most branches are predictable, that is the reason of hardware branch predictor.
  2 the average basic block size is only 5, the number of instructions that can be freely scheduled between cr operations is much smaller, if the target is P8 class multi issue processor, we need more instructions to feed the functional units.

So it looks the better default behavior should be disable the cr manipulation
instructions, and enable it only when compiler finds optimization opportunity.

Quuxplusone commented 7 years ago

I took a look at just turning off crbits for power8 last night and it mostly was a win - except for one testcase that regressed horribly. I'll take a look there and see what I can do.

Quuxplusone commented 7 years ago

(In reply to Carrot from comment #4)
> (In reply to Hal Finkel from comment #3)
> > (In reply to Eric Christopher from comment #2)
> > > (Or, I guess, just conditionally turn off crbits...)
> >
> > Yea. On embedded-style hardware this can make sense, although in this case
> > the early return might help there too (if you're likely to take it). This is
> > not the first time we've seen this kind of thing (at least on the P8). It
> > seems pretty clear that crbits are not useful here for two conditions
> > anded/ored together before a branch. What we need to understand is how
> > universal this is -- for example, if I have 10 conditions anded together, do
> > I still want the branches, or does the branch predictor, etc. start to have
> > problems after awhile?
> >
> > To make a more-general point, one the reasons crbits help is that, if you
> > need to keep booleans in registers, then you reduce register pressure by
> > keeping the values in the cr register. This does not apply here (you
> > wouldn't be keeping the boolean values in registers regardless). Another
> > reason is that it keeps otherwise-larger basic blocks together and that
> > helps with instruction selection and scheduling. This also does not apply
> > here (this basic block is small anyway - and as it turns out at least,
> > whether or not the block is split does not really affect instruction
> > selection).
>
> So there are 2 situations cr manipulation instructions can help.
>   1 the branches are unpredictable
>   2 there will be large basic block after using cr manipulation
> instructions, so there are many instructions can be scheduled between cr
> instructions to hide latency of cr operations, usually it means the large
> basic block contains the first cmp instruction, all later ones are simple
> cmp/br blocks.
>
> But the common sense is:
>   1 most branches are predictable, that is the reason of hardware branch
> predictor.
>   2 the average basic block size is only 5, the number of instructions that
> can be freely scheduled between cr operations is much smaller, if the target
> is P8 class multi issue processor, we need more instructions to feed the
> functional units.

FWIW, it is not the average BB size that is relevant, but the average hot BB,
and this is more workload dependent than the unweighted average.

>
> So it looks the better default behavior should be disable the cr
> manipulation instructions, and enable it only when compiler finds
> optimization opportunity.

When I've looked at this in the past, I've seen a variety of regressions from
just disabling crbits completely. However, I think we need to take a serious
look at making the expansion logic in SelectionDAGBuilder::visitBr fire more
often (I recall Nemanja was looking at this). There are clearly common cases
where there's no expected benefits from using crbits, and we might be able to
recognize them relatively early.

Quuxplusone commented 7 years ago

(In reply to Hal Finkel from comment #6)
> (In reply to Carrot from comment #4)
> > (In reply to Hal Finkel from comment #3)
> > > (In reply to Eric Christopher from comment #2)
> > > > (Or, I guess, just conditionally turn off crbits...)
> > >
> > > Yea. On embedded-style hardware this can make sense, although in this case
> > > the early return might help there too (if you're likely to take it). This
is
> > > not the first time we've seen this kind of thing (at least on the P8). It
> > > seems pretty clear that crbits are not useful here for two conditions
> > > anded/ored together before a branch. What we need to understand is how
> > > universal this is -- for example, if I have 10 conditions anded together,
do
> > > I still want the branches, or does the branch predictor, etc. start to
have
> > > problems after awhile?
> > >
> > > To make a more-general point, one the reasons crbits help is that, if you
> > > need to keep booleans in registers, then you reduce register pressure by
> > > keeping the values in the cr register. This does not apply here (you
> > > wouldn't be keeping the boolean values in registers regardless). Another
> > > reason is that it keeps otherwise-larger basic blocks together and that
> > > helps with instruction selection and scheduling. This also does not apply
> > > here (this basic block is small anyway - and as it turns out at least,
> > > whether or not the block is split does not really affect instruction
> > > selection).
> >
> > So there are 2 situations cr manipulation instructions can help.
> >   1 the branches are unpredictable
> >   2 there will be large basic block after using cr manipulation
> > instructions, so there are many instructions can be scheduled between cr
> > instructions to hide latency of cr operations, usually it means the large
> > basic block contains the first cmp instruction, all later ones are simple
> > cmp/br blocks.
> >
> > But the common sense is:
> >   1 most branches are predictable, that is the reason of hardware branch
> > predictor.
> >   2 the average basic block size is only 5, the number of instructions that
> > can be freely scheduled between cr operations is much smaller, if the target
> > is P8 class multi issue processor, we need more instructions to feed the
> > functional units.
>
> FWIW, it is not the average BB size that is relevant, but the average hot
> BB, and this is more workload dependent than the unweighted average.
>
> >
> > So it looks the better default behavior should be disable the cr
> > manipulation instructions, and enable it only when compiler finds
> > optimization opportunity.
>
> When I've looked at this in the past, I've seen a variety of regressions
> from just disabling crbits completely. However, I think we need to take a
> serious look at making the expansion logic in SelectionDAGBuilder::visitBr
> fire more often (I recall Nemanja was looking at this). There are clearly
> common cases where there's no expected benefits from using crbits, and we
> might be able to recognize them relatively early.

We should also look at how TLI->hasMultipleConditionRegisters() is used in CGP
to gate SinkCmpExpression. I suspect that this might not be desirable on the P8.

Quuxplusone commented 7 years ago

(In reply to Eric Christopher from comment #5)
> I took a look at just turning off crbits for power8 last night and it mostly
> was a win - except for one testcase that regressed horribly. I'll take a
> look there and see what I can do.

I think that we need to conduct a more systematic investigation and really try
to understand how to model what's going on here. Here's my attempt at starting
this...

I've written a small program generation which generates functions like this:

void test(int n, TYPE *restrict out, TYPE *restrict a, TYPE *restrict b, TYPE
*restrict c, TYPE *restrict d, TYPE *restrict e, TYPE *restrict f, TYPE
*restrict g) {
  for (int i = 0, j = 0; i < n; ++i) {
    if (<condition>)
      continue;
    out[j++] = a[i];
  }
}

where the arrays a through g are either filled with uniform values (to make the
branches in <condition> predictable) or random values (to make the branches in
<condition> unpredictable). I then generated random conditions of varying
logical depth. I then also varied whether crbits were enabled and whether loop
unrolling was enabled (vectorization was always disabled). Here are some
results (run on a P8 system):

depth   seed    type    default
(pred)  nocrbits    nounrl  nocrbits_nounrl unpred  nocrbits_unpred nounrl_unpred   nocrbits_nounrl_unpred  best_pred   best_unpred
2   1   int 1.71    1.70    3.00    3.06    6.29    6.29    6.30    6.30    nocrbits    default
2   1   float   2.46    2.86    2.78    3.45    8.84    10.16   9.34    11.05   default default
2   2   int 1.47    1.47    1.55    1.55    6.17    6.17    6.50    6.51    default default
2   2   float   1.35    1.35    1.56    1.56    7.75    7.76    7.95    7.95    default default
2   3   int 1.38    1.39    1.83    1.85    6.25    6.25    6.57    6.57    default default
2   3   float   1.75    2.21    1.85    2.70    9.26    11.00   9.57    11.22   default default
2   4   int 1.39    1.39    1.85    1.85    6.31    6.30    6.45    6.45    default nocrbits
2   4   float   1.54    1.54    1.85    1.85    7.70    7.71    7.96    7.95    default default
2   5   int 1.54    1.53    2.15    2.16    6.27    6.28    6.63    6.63    nocrbits    default
2   5   float   2.09    2.60    2.15    2.87    9.21    11.01   9.60    11.22   default default
3   1   int 2.01    2.01    2.51    2.50    8.34    8.35    8.49    8.48    default default
3   1   float   3.80    4.76    3.72    4.44    11.45   13.13   11.77   13.21   nounrl  default
3   2   int 3.54    3.55    3.08    3.08    10.49   10.50   10.32   10.33   nounrl  nounrl
3   2   float   3.07    3.10    3.07    3.13    13.88   14.39   13.97   14.31   default default
3   3   int 1.38    1.38    1.84    1.85    5.06    5.05    5.45    5.44    default nocrbits
3   3   float   1.75    2.15    1.85    2.70    7.53    9.67    7.52    9.69    default nounrl
3   4   int 1.54    1.55    1.84    1.84    7.37    8.21    7.52    8.40    default default
3   4   float   1.54    1.68    1.85    1.85    9.01    10.71   9.50    11.01   default default
3   5   int 1.76    1.77    1.68    1.68    6.89    6.86    7.00    7.00    nounrl  nocrbits
3   5   float   3.45    4.78    3.43    4.22    9.95    13.01   10.20   12.70   nounrl  default
4   1   int 2.52    4.01    2.60    4.00    10.42   11.18   10.40   11.12   default nounrl
4   1   float   3.42    5.04    3.42    5.03    16.60   19.96   16.59   20.04   default nounrl
4   2   int 0.00    0.00    0.00    0.00    0.30    0.30    0.33    0.28    default nocrbits_nounrl
4   2   float   4.60    5.53    4.69    5.88    19.48   23.08   19.48   23.06   default default
4   3   int 3.88    4.30    3.89    4.62    11.37   12.26   11.38   12.26   default default
4   3   float   8.72    8.21    8.73    8.21    15.88   18.84   15.86   18.84   nocrbits    nounrl
4   4   int 4.04    3.74    4.05    3.74    10.16   13.96   10.17   13.95   nocrbits    default
4   4   float   5.63    5.26    5.63    5.26    15.90   19.56   15.73   19.57   nocrbits    nounrl
4   5   int 0.00    0.00    0.00    0.00    0.31    0.31    0.31    0.30    default nocrbits_nounrl
4   5   float   3.08    3.39    3.08    3.49    15.27   19.07   15.28   19.07   default default

For example, in d=2, n=1, <condition> is:

  ((f[i] <= c[i]) && (a[i] > b[i]))

and in d=4, n=5, <condition> is:

  ((((d[i] < g[i]) || (d[i] < b[i])) && ((d[i] >= f[i]) || (b[i] <= e[i]))) || (((c[i] < e[i]) || (c[i] >= g[i])) || ((g[i] <= d[i]) || (c[i] >= e[i]))))

The take-away message is that, in general, crbits produce better performance
for both predictable and unpredictable branches (there are few places where
'nocrbits' occurs in the last two columns indicating the best configuration,
and even fewer where the difference is particularly significant). There are
also some places where our unrolling does not help, and that's something else
we should understand.

Quuxplusone commented 7 years ago

Attached cond_test_v1.tar.gz (3095 bytes, application/gzip): Test program generator, run script, and generated programs.

Quuxplusone commented 7 years ago

Agreed with the need for a principled look - worth looking if it was a universal loss, but nothing is ever that easy :)

Quuxplusone commented 7 years ago

The pass that I implemented in https://reviews.llvm.org/D30431 does late block
splitting on CR logical operations. As currently implemented, it always splits
the blocks if it is able to.
Interestingly enough, this appears to be performance-neutral on a number of
workloads that I ran it on. However, I think that when we have a better
understanding of when to use the CR logicals and when to split, this pass can
be improved (for example, we can trivially extend this pass not to split blocks
larger than size N).
Furthermore, I've attempted to set up the pass to be extensible so that we can
do other optimizations on these CR-logical operations in MachineSSA (if we
think of any).
Personally, I believe an approach like this may give us the "best of both
worlds" type of solution - the ability to use CR bits for things they're good
for (like reducing register pressure for booleans) and removing them when
they're likely to hurt performance.

Also, I'd love to get some feedback on the patch :).

Quuxplusone commented 7 years ago

I meant to remark that the pass will predictably split the test case originally
noted in the bug into this:

.LBB0_1:                                # %for.cond
                                        # =>This Inner Loop Header: Depth=1
        ld 3, 0(3)
        cmpdi    3, 0
        bclr 12, 2, 0
# BB#2:                                 # %for.cond
                                        #   in Loop: Header=BB0_1 Depth=1
        cmpld    3, 4
        bc 12, 0, .LBB0_1
# BB#3:                                 # %for.end
        blr

Quuxplusone commented 7 years ago

To make life more interesting, here's a second set of generated benchmarks.
These were created to be more like the test case in this bug. From this, I'd
draw the opposite conclusion compared to my last set of tests ;)

depth   seed    type    default
(pred)  nocrbits    nounrl  nocrbits_nounrl unpred  nocrbits_unpred nounrl_unpred   nocrbits_nounrl_unpred  best_pred   best_unpred
0   1   int 2.47    2.34    2.42    2.33    2.74    2.64    2.76    2.69    nocrbits_nounrl nocrbits
0   1   float   3.23    2.37    3.03    2.37    2.68    2.73    3.23    2.65    nocrbits    nocrbits_nounrl
0   2   int 3.16    2.35    2.44    2.34    2.75    2.67    2.72    2.68    nocrbits_nounrl nocrbits
0   2   float   2.45    2.35    2.43    2.74    2.75    2.68    2.75    2.64    nocrbits    nocrbits_nounrl
0   3   int 2.46    2.36    2.43    2.33    2.73    2.65    2.74    2.64    nocrbits_nounrl nocrbits_nounrl
0   3   float   2.47    2.35    2.42    2.33    2.74    2.68    2.74    3.07    nocrbits_nounrl nocrbits
0   4   int 2.45    2.87    2.42    2.98    2.79    3.20    2.79    2.68    nounrl  nocrbits_nounrl
0   4   float   2.42    2.82    2.42    2.92    2.73    2.72    2.77    2.68    default nocrbits_nounrl
0   5   int 2.44    2.33    2.45    2.36    2.73    2.64    2.74    2.87    nocrbits    nocrbits
0   5   float   2.43    2.34    2.45    2.35    2.73    2.68    2.73    2.77    nocrbits    nocrbits
1   1   int 2.74    2.69    2.72    2.79    3.02    3.56    3.02    3.38    nocrbits    default
1   1   float   2.78    2.71    2.75    2.70    3.06    2.98    3.07    2.98    nocrbits_nounrl nocrbits
1   2   int 3.42    2.68    3.06    2.68    3.04    2.97    3.14    3.02    nocrbits    nocrbits
1   2   float   3.34    3.21    3.36    3.22    3.66    3.58    4.33    3.54    nocrbits    nocrbits_nounrl
1   3   int 3.17    2.66    3.24    2.66    3.53    3.33    3.04    2.97    nocrbits    nocrbits_nounrl
1   3   float   3.99    3.19    3.34    3.20    3.65    3.50    3.65    3.49    nocrbits    nocrbits_nounrl
1   4   int 2.71    2.68    2.66    2.70    3.05    3.01    3.06    3.01    nounrl  nocrbits
1   4   float   2.77    2.74    3.04    2.73    3.66    3.10    3.10    3.55    nocrbits_nounrl nocrbits
1   5   int 2.71    2.68    2.62    2.71    3.01    3.60    3.42    3.00    nounrl  nocrbits_nounrl
1   5   float   2.77    2.68    2.76    2.68    3.06    2.99    3.04    3.02    nocrbits    nocrbits
2   1   int 3.16    2.72    3.17    2.75    7.23    5.03    7.29    5.02    nocrbits    nocrbits_nounrl
2   1   float   3.77    3.46    3.86    3.48    7.87    8.65    7.88    8.65    nocrbits    default
2   2   int 2.84    2.69    2.83    2.67    6.74    6.67    6.46    6.39    nocrbits_nounrl nocrbits_nounrl
2   2   float   2.83    2.69    2.81    2.71    6.59    6.48    6.60    6.48    nocrbits    nocrbits
2   3   int 2.77    2.83    2.73    2.74    6.85    6.78    6.51    6.45    nounrl  nocrbits_nounrl
2   3   float   3.18    3.06    3.15    3.07    7.87    9.49    7.88    9.49    nocrbits    default
2   4   int 2.73    2.73    2.66    2.72    6.85    6.77    6.72    6.43    nounrl  nocrbits_nounrl
2   4   float   2.72    2.69    2.67    2.67    6.57    6.47    6.57    6.48    nounrl  nocrbits
2   5   int 2.85    2.68    2.93    2.70    6.84    6.73    6.14    6.21    nocrbits    nounrl
2   5   float   3.63    3.39    3.58    3.39    7.89    9.50    7.88    9.50    nocrbits    nounrl
3   1   int 3.28    2.95    3.26    2.92    8.92    7.75    8.69    7.78    nocrbits_nounrl nocrbits
3   1   float   5.34    5.34    5.34    5.34    12.16   13.28   12.16   13.28   default default
3   2   int 3.16    3.43    3.18    3.42    11.65   8.95    11.23   9.03    default nocrbits
3   2   float   3.61    3.46    3.60    3.68    12.82   12.96   12.84   12.96   nocrbits    default
3   3   int 2.74    2.72    2.72    2.74    5.65    6.07    5.63    5.88    nocrbits    nounrl
3   3   float   3.14    3.36    3.16    3.05    8.39    10.11   8.40    10.13   nocrbits_nounrl default
3   4   int 2.69    2.73    2.79    2.70    7.60    8.28    7.60    8.30    default default
3   4   float   2.73    2.72    2.72    2.66    9.44    10.47   9.44    10.44   nocrbits_nounrl default
3   5   int 3.02    2.98    3.00    2.70    7.77    7.16    8.22    7.35    nocrbits_nounrl nocrbits
3   5   float   4.78    4.76    4.78    5.07    11.16   12.84   11.15   12.83   nocrbits    nounrl
4   1   int 3.90    3.16    3.70    3.14    10.92   10.67   10.90   10.83   nocrbits_nounrl nocrbits
4   1   float   5.12    4.74    4.88    4.75    17.03   19.85   17.01   19.79   nocrbits    nounrl
4   2   int 2.70    2.63    2.69    2.66    3.00    2.98    3.00    3.01    nocrbits    nocrbits
4   2   float   5.82    5.51    5.82    5.53    19.48   22.90   19.46   22.79   nocrbits    nounrl
4   3   int 5.17    4.35    5.19    4.35    13.18   12.38   13.14   12.52   nocrbits    nocrbits
4   3   float   9.57    8.19    9.57    8.30    17.28   18.72   17.26   18.72   nocrbits    nounrl
4   4   int 3.76    3.43    3.76    3.45    12.75   13.77   12.64   13.70   nocrbits    nounrl
4   4   float   6.08    4.95    6.07    4.88    16.59   19.02   16.58   19.03   nocrbits_nounrl nounrl
4   5   int 2.71    2.64    2.71    2.64    3.00    2.97    2.98    2.95    nocrbits    nocrbits_nounrl
4   5   float   3.41    3.28    3.43    3.51    16.21   19.39   16.12   19.39   nocrbits    nounrl

One thing that is interesting to note is that all of the depth 0 loops are the
same, so these numbers are pretty noisy, but the overall trend is clear. Here
nocrbits is clearly better (except for depth 3 and above, where the crbits
looks okay when the branches are unpredictable).

The depth 0 tests look very much like the reported case:

struct S *test(struct S * restrict p, struct S * restrict end, struct S **
restrict lp) {
  struct S *n;
  for (; (n = p->next) !=0 && n < end; p = n);
  return n;
}

whereas the others have been made a bit more complicated; so example d=4, n=5
is:

struct S {
  struct S *next;
  TYPE a, b, c, d, e, f, g;
};
struct S *test(struct S * restrict p, struct S * restrict end, struct S **
restrict lp) {
  struct S *n;
  for (; (n = p->next) !=0 && n < end; p = n)  {
    if (((((p->d < p->g) || (p->d < p->b)) && ((p->d >= p->f) || (p->b <= p->e))) || (((p->c < p->e) || (p->c >= p->g)) || ((p->g <= p->d) || (p->c >= p->e))))) *lp = p;
  }
  return n;
}

Quuxplusone commented 7 years ago

Attached cond_test2_v1.tar.gz (3791 bytes, application/gzip): Test program generator, run script, and generated programs - 2

Quuxplusone commented 7 years ago

To try another variant, I generated a bunch of kernels like this:

struct S {
  struct S *next;
  TYPE a, b, c, d, e, f, g;
  TYPE out;
};
struct S *test(struct S * restrict p, struct S * restrict end) {
  struct S *n;
  for (; (n = p->next) !=0 && n < end; p = n)  {
    if (((((p->d < p->g) || (p->d < p->b)) && ((p->d >= p->f) || (p->b <= p->e))) || (((p->c < p->e) || (p->c >= p->g)) || ((p->g <= p->d) || (p->c >= p->e))))) p->out = p->a;
  }
  return n;
}

I thought maybe adding a store into the loop might change something. It did not
really (results still favor nocrbits):

depth   seed    type    default (pred)  nocrbits        nounrl  nocrbits_nounrl
unpred  nocrbits_unpred nounrl_unpred   nocrbits_nounrl_unpred  best_p
red     best_unpred
0       1       int     2.19    2.09    3.19    2.26    3.28    2.39    2.50
2.60    nocrbits        nocrbits
0       1       float   2.19    2.10    2.22    2.06    3.28    2.49    2.52
2.84    nocrbits_nounrl nocrbits
0       2       int     2.21    2.88    2.18    2.38    2.50    3.20    2.64
2.37    nounrl  nocrbits_nounrl
0       2       float   2.70    2.55    3.04    2.08    2.92    2.37    2.48
3.26    nocrbits_nounrl nocrbits
0       3       int     2.51    2.12    2.21    2.09    2.49    2.38    2.50
2.40    nocrbits_nounrl nocrbits
0       3       float   2.21    2.69    2.22    3.14    2.48    2.38    3.07
2.40    default nocrbits
0       4       int     3.04    3.62    3.14    2.09    3.16    3.77    2.58
2.39    nocrbits_nounrl nocrbits_nounrl
0       4       float   3.07    2.09    2.29    2.72    2.89    2.41    2.51
2.39    nocrbits        nocrbits_nounrl
0       5       int     2.26    2.18    3.17    3.53    3.04    2.40    3.28
2.41    nocrbits        nocrbits
0       5       float   3.10    2.09    2.20    2.08    3.19    2.38    2.88
3.19    nocrbits_nounrl nocrbits
1       1       int     3.22    2.53    2.50    3.03    6.14    5.82    5.72
5.94    nounrl  nounrl
1       1       float   2.61    2.52    2.59    2.55    6.77    6.70    6.82
6.74    nocrbits        nocrbits
1       2       int     2.52    2.53    2.54    2.52    5.64    5.39    5.47
5.30    default nocrbits_nounrl
1       2       float   3.09    3.28    3.07    3.59    7.08    8.33    7.13
8.34    nounrl  default
1       3       int     3.40    2.90    3.19    2.52    5.94    5.88    5.72
5.71    nocrbits_nounrl nocrbits_nounrl
1       3       float   3.09    3.08    3.49    3.45    7.92    9.50    7.95
9.51    nocrbits        default
1       4       int     3.01    2.58    2.57    3.04    3.53    3.34    3.62
3.42    nounrl  nocrbits
1       4       float   2.62    3.36    2.66    2.53    6.02    6.07    5.91
6.03    nocrbits_nounrl nounrl
1       5       int     2.62    2.59    2.51    2.53    5.85    5.87    5.84
5.84    nounrl  nounrl
1       5       float   2.60    2.54    2.49    3.27    6.85    6.68    6.81
6.68    nounrl  nocrbits
2       1       int     4.32    3.59    4.17    3.54    7.24    5.11    7.31
5.15    nocrbits_nounrl nocrbits
2       1       float   4.67    4.43    4.46    4.51    9.99    10.70   10.00
10.70   nocrbits        default
2       2       int     2.71    2.56    3.22    2.62    7.16    7.01    7.06
7.08    nocrbits        nocrbits
2       2       float   3.34    2.56    3.25    2.56    8.21    8.14    8.30
8.17    nocrbits        nocrbits
2       3       int     3.53    3.52    3.71    3.63    7.14    7.18    7.17
7.06    nocrbits        nocrbits_nounrl
2       3       float   4.03    4.05    4.19    3.87    9.85    11.25   9.86
11.28   nocrbits_nounrl default
2       4       int     3.83    3.51    3.87    3.53    7.16    7.11    6.75
6.80    nocrbits        nounrl
2       4       float   3.50    3.57    3.72    3.66    8.30    8.29    8.42
8.28    default nocrbits_nounrl
2       5       int     3.96    3.51    3.97    3.46    7.04    7.26    7.00
7.19    nocrbits_nounrl nounrl
2       5       float   4.60    4.13    4.03    4.50    9.80    11.17   9.86
11.23   nounrl  default
3       1       int     4.79    3.74    4.52    3.83    8.35    7.80    8.22
7.84    nocrbits        nocrbits
3       1       float   7.08    6.24    7.04    6.33    12.26   13.31   12.26
13.31   nocrbits        default
3       2       int     3.59    3.18    3.40    3.37    11.66   10.32   11.30
10.42   nocrbits        nocrbits
3       2       float   4.10    3.29    3.53    4.04    14.67   14.52   14.68
14.53   nocrbits        nocrbits
3       3       int     3.63    3.53    3.54    3.89    6.12    6.27    6.00
6.18    nocrbits        nounrl
3       3       float   4.41    3.75    4.06    3.80    8.68    9.94    8.77
9.99    nocrbits        default
3       4       int     3.77    3.52    3.80    3.61    8.22    8.73    8.20
8.77    nocrbits        nounrl
3       4       float   3.71    3.61    3.55    3.42    9.66    10.95   9.60
11.01   nocrbits_nounrl nounrl
3       5       int     3.55    3.04    3.03    2.64    7.95    7.32    7.96
7.49    nocrbits_nounrl nocrbits
3       5       float   4.99    4.72    5.01    4.84    11.21   12.86   11.13
12.87   nocrbits        nounrl
4       1       int     3.80    3.40    3.66    3.48    10.94   10.67   10.93
10.69   nocrbits        nocrbits
4       1       float   5.06    4.73    5.13    4.92    16.74   19.98   16.73
19.97   nocrbits        nounrl
4       2       int     3.40    3.59    3.73    3.78    3.79    3.87    4.10
3.81    default default
4       2       float   6.53    6.10    6.48    5.84    19.47   22.83   19.47
22.85   nocrbits_nounrl default
4       3       int     5.17    4.62    5.70    4.32    13.09   12.38   13.09
12.42   nocrbits_nounrl nocrbits
4       3       float   10.13   8.35    9.61    8.28    17.05   18.65   17.01
18.65   nocrbits_nounrl nounrl
4       4       int     3.81    3.40    4.18    3.40    12.78   13.82   12.68
13.78   nocrbits        nounrl
4       4       float   6.61    4.94    6.59    4.96    17.19   18.99   17.14
18.99   nocrbits        nounrl
4       5       int     3.47    3.74    3.40    3.58    4.09    3.83    3.88
4.10    nounrl  nocrbits
4       5       float   4.61    4.37    4.61    4.21    16.18   19.21   16.22
19.25   nocrbits_nounrl default

I'm still not sure how much I trust the actual numbers, but the trend seems
relatively clear. Except for depth 0 (which is the originally-reported kernel),
crbits are preferred for unpredictable branches and crbits are better when the
results are predictable.

Quuxplusone commented 7 years ago

Attached cond_test3_v1.tar.gz (4003 bytes, application/gzip): Test program generator, run script, and generated programs - 3

Quuxplusone / LLVMBugzillaTest

[ppc] crandc of two cmps is much slower than two cmp/br #31292