FDO: don't compile cold branches

GoogleCodeExporter commented 9 years ago

The feedback gathering code added in r778 allows us to track whether 
branches are taken or not. We should use this to avoid generating LLVM IR for 
cold branches, bailing back to the interpreter in these cases. Benefits from 
this, in no particular order:

a) less IR for LLVM to analyze/optimize/compile (faster compilation).
b) lower memory consumption.
c) omitting cold branches may expose additional optimizations or allow 
known optimizations to have a wider effect (constant/type propagation comes 
to mind). This is similar to the way that Self's machine code bails on int 
overflow so that the hot path can assume ints all the way through.

Original issue reported on code.google.com by collinw on 25 Jul 2009 at 9:02

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 29 Sep 2009 at 5:57

Added labels: Release-2009Q4

GoogleCodeExporter commented 9 years ago

Patch sent for review: http://codereview.appspot.com/131041

Original comment by collinw on 12 Oct 2009 at 8:32

Changed state: Started

GoogleCodeExporter commented 9 years ago

Committed as r875.

### django -n 100

django:
Min: 0.940084 -> 0.925403: 1.59% faster
Avg: 0.942485 -> 0.927391: 1.63% faster
Significant (t=27.065775, a=0.95)
Stddev: 0.00409 -> 0.00380: 7.62% smaller

Mem max: 27172.000 -> 26240.000: 3.55% smaller
Usage over time: http://tinyurl.com/yzxh3yw

Before:

    Bails 0 times

    LLVM IR size in lines (n=19):
    Min: 162
    Median: 793
    Mean: 1307
    Max: 4708
    Sum: 24851

    Native code size in bytes (n=19):
    Min: 157
    Median: 1548
    Mean: 2826
    Max: 14394
    Sum: 53699

After:

    Bails 0 times

    LLVM IR size in lines (n=19):
    Min: 162
    Median: 657
    Mean: 1196
    Max: 4456
    Sum: 22733

    Native code size in bytes (n=19):
    Min: 157
    Median: 1474
    Mean: 2623
    Max: 13458
    Sum: 49842

    Conditional branch optimization:
        Total cond branches: 60
        Optimized branches: 24
        Insufficient data: 31
        Inconsistent branches: 5

### 2to3 -f all lib/2to3

2to3:
Min: 32.253097 -> 32.278093: 0.08% slower
Avg: 32.299290 -> 32.330485: 0.10% slower
Not significant
Stddev: 0.06193 -> 0.05312: 16.59% smaller

Before:

    Bailed to the interpreter 0 times

    LLVM IR size in lines (n=37):
    Min: 222
    Median: 863
    Mean: 1199
    Max: 5956
    Sum: 44388

    Native code size in bytes (n=37):
    Min: 157
    Median: 2186
    Mean: 3144
    Max: 15543
    Sum: 116350

After:

    Bailed to the interpreter 2 times
        GUARD_FAIL: 2

    LLVM IR size in lines (n=37):
    Min: 222
    Median: 871
    Mean: 1099
    Max: 5081
    Sum: 40692

    Native code size in bytes (n=37):
    Min: 157
    Median: 2103
    Mean: 2864
    Max: 13297
    Sum: 106002

    Conditional branch optimization:
        Total cond branches: 108
        Optimized branches: 36
        Insufficient data: 15
        Inconsistent branches: 57

### rietveld -n 100

rietveld:
Min: 0.514955 -> 0.514480: 0.09% faster
Avg: 0.618000 -> 0.605701: 2.03% faster
Not significant
Stddev: 0.17461 -> 0.14563: 19.90% smaller

Mem max: 55916.000 -> 53064.000: 5.37% smaller
Usage over time: http://tinyurl.com/yh6jwdg

Before:

    Bailed to the interpreter 1 times
        FATAL_GUARD_FAIL: 1

    LLVM IR size in lines (n=197):
    Min: 122
    Median: 493
    Mean: 980
    Max: 5885
    Sum: 193167

    Native code size in bytes (n=197):
    Min: 157
    Median: 1395
    Mean: 2786
    Max: 15728
    Sum: 548987

After:

    Bailed to the interpreter 1 times
        FATAL_GUARD_FAIL: 1

    LLVM IR size in lines (n=197):
    Min: 122
    Median: 426
    Mean: 819
    Max: 5885
    Sum: 161374

    Native code size in bytes (n=197):
    Min: 157
    Median: 1163
    Mean: 2307
    Max: 15368
    Sum: 454581

    Conditional branch optimization:
        Total cond branches: 419
        Optimized branches: 213
        Insufficient data: 145
        Inconsistent branches: 61

Original comment by collinw on 23 Oct 2009 at 10:07

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Reopening to track an enhancement on top of the already-implemented support:

Currently, if we have enough feedback, we can optimize non-taken branches away. 
However, we should be able to carry that bias 
through more aggressively. We currently check truthiness like this:

if (x == &PyTrue)
  goto true_branch
else if (x == &PyFalse)
  goto false_branch
py_true = PyObject_IsTrue(x)
if (py_true == -1)
  goto propagate_exception
else if (py_true == 0)
  goto false_branch
else if (py_true == 1)
  goto true_branch

We already have data about whether the input to the conditional branch is True, 
False or something we're having to coerce with 
PyObject_IsTrue. We should use that data to replace the PyObject_IsTrue check 
with a bail to the interpreter if it's cold.

Original comment by collinw on 28 Oct 2009 at 8:32

Changed state: Accepted
Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 6 Jan 2010 at 11:43

Removed labels: Release-2009Q4

Daetalus / unladen-swallow

FDO: don't compile cold branches #72