Offload JIT compilation to secondary threads

GoogleCodeExporter commented 8 years ago

Execution should not block on compiling a function with LLVM or
reoptimizing it with new data. We should send these work units to separate
worker threads, allowing the main threads to carry on unimpeded.

The first implementation could/should probably just be a simple FIFO queue
of work items with a single worker thread. We can add heuristics and
additional worker threads as the data warrants.

Original issue reported on code.google.com by collinw on 28 May 2009 at 9:57

GoogleCodeExporter commented 8 years ago

Shifting to Reid, since he expressed interest in working on this.

Original comment by collinw on 25 Jul 2009 at 9:02

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Here's a status update on this.  I have a very large change that's been out for 
a
while: http://codereview.appspot.com/101047 .  It turns out that it's very hard 
to
implement this properly on top of pythreads in the presence of os.fork.  The 
basic
design is that for each function we wish to compile, we create a job that goes 
on an
input queue to a background thread that does compilation while not holding the 
GIL. 
When compilation is finished, the background thread puts the finished jobs on 
the
output queue which the foreground threads poll periodically.

After beating on this for awhile, the main problem that I'm having now is with 
the
feedback directed optimization etc.  Previously, compiling bytecode to LLVM IR
required no external Python state.  Now, with the LOAD_GLOBAL FDO and others 
soon
coming, it seems like we're going to want to access a lot of Python state.  We 
need
to come up with a way to do that safely.  Some options:

- Have the background thread acquire the GIL whenever it needs to access state. 
 This
creates less GIL contention, but it's extremely tricky because state can change
arbitrarily between GIL acquires.  For example, the assumed globals dict could 
be
deleted.
- Have the background thread acquire the GIL for the LLVM IR generation phase.  
This
creates more contention, while still offloading LLVM optimization and code
generation.  It's possible this will create a GIL battle, but we can't know 
without
trying it.
- Have the foreground thread requesting compilation acquire a lock on the LLVM 
data
and generate the IR while holding the GIL.  This can block the eval loop, but 
will
probably avoid a GIL battle.  Instead, it will create contention for the LLVM 
data
lock.  For example, if the background thread is doing native codegen for a large
function and a foreground thread finds a hot function, the eval loop will wait 
for
the LLVM lock, and then do IR generation.
- Alternatively, we find a way to magically solve the GIL problem by switching 
to
object-level locks on dicts and code objects.

Original comment by reid.kle...@gmail.com on 10 Sep 2009 at 6:28

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 26 Sep 2009 at 12:05

Added labels: Release-2009Q4

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 6 Jan 2010 at 11:41

Added labels: Release-Merger
Removed labels: Release-2009Q4

GoogleCodeExporter commented 8 years ago

Issue 110 has been merged into this issue.

Original comment by reid.kle...@gmail.com on 10 Jan 2010 at 8:59

GoogleCodeExporter commented 8 years ago

Work is proceeding on this in the http://code.google.com/p/unladen-
swallow/source/browse/branches/background-thread branch.

Reid, can you give a quick status update on the branch?

Original comment by collinw on 11 Jan 2010 at 7:54

Changed state: Started
Added labels: Priority-Critical
Removed labels: Priority-High

GoogleCodeExporter commented 8 years ago

So everything is basically done, but like any change that may introduce 
concurrency
bugs, it needs to be tested thoroughly.  My problem is that I haven't had a 
single
block of time where I can sit down, run the tests, and chase down any bugs I 
find.

Hopefully, over these next three weeks of no classes, I'll be able to find such 
a
block of time.  When are you planning to do the Q4 release?  End of January?

Original comment by reid.kle...@gmail.com on 12 Jan 2010 at 1:11

GoogleCodeExporter commented 8 years ago

As of last night, the following tests fail for me on my x86_64 Linux desktop 
with 6
GB of RAM in the branch under -j always:
    test_asynchat test_asyncore test_importhooks test_ioctl

On my Macbook with 4 GB of RAM, it runs out of memory while resizing an
AbstractTypeUser array in LLVM to be about 256 MB.  I don't think that's 
related to
the background thread, though, but it's worth noting when trying to track down 
extra
memory usage later.

Everything passes without -j always.  I assume that when I merge r1010 into the
branch, test_importhooks will pass again.

Notably, the test_multiprocessing failure seems to have disappeared after I 
merged in
2.6.4, which is awesome.

None of the test failures for asynchat, asyncore, or ioctl are reproducible 
when run
by themselves under -j always.  The async tests have some kind of timing 
dependency
that's being disturbed by having the foreground thread release the GIL to wait 
for
compilation.  I don't know what the deal is for the ioctl tests.  Perhaps they 
could
be made to pass by using set_jit_control("never") in setUp and tearDown.

Are these test failures under -j always enough to block the merger?  Is there
anything else we need to do before merging this?  Finally, since jyasskin has 
already
combed over it a few times and probably doesn't want to again, do we want to do 
any
kind of code review for merging the branch?

Original comment by reid.kle...@gmail.com on 20 Jan 2010 at 6:36

GoogleCodeExporter commented 8 years ago

If set_jit_control("never") makes those tests pass under -j always, I'm fine 
with making that change. Just add a note and create a 
low-priority bug mentioning the test failures.

The biggest item before merger is benchmarking. Can you benchmark your branch 
against trunk and against CPython 2.6.4 
from vendor/? I'm most interested in the results from `perf.py -r 
default,apps,startup`.

Original comment by collinw on 20 Jan 2010 at 9:07

GoogleCodeExporter commented 8 years ago

The regression tests all pass now with the background thread, except for
test_jit_gdb, which hangs until I kill gdb, and the run continues.  I'll have to
attach to gdb with gdb to see what's going on there.  :)

Here's benchmark results for default:

[reid@muikyl unladen-test]$ ./perf.py -r ../unladen-trunk/python
../unladen-threaded/python

Report on Linux muikyl 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 17:01:44 UTC 
2009
x86_64 
Total CPU cores: 8

### 2to3 ###
Min: 17.020000 -> 17.150000: 1.0076x slower
Avg: 17.140000 -> 17.246000: 1.0062x slower
Not significant
Stddev: 0.09138 -> 0.11082: 1.2127x larger
Timeline: http://tinyurl.com/yebcuxq

### django ###
Min: 0.556600 -> 0.557424: 1.0015x slower
Avg: 0.560892 -> 0.561046: 1.0003x slower
Not significant
Stddev: 0.00279 -> 0.00132: 2.1156x smaller
Timeline: http://tinyurl.com/y8bp4cy

### nbody ###
Min: 0.159952 -> 0.154391: 1.0360x faster
Avg: 0.164053 -> 0.156924: 1.0454x faster
Significant (t=15.205140, a=0.95)
Stddev: 0.00423 -> 0.00202: 2.0958x smaller
Timeline: http://tinyurl.com/yg2h32l

### slowpickle ###
Min: 0.357190 -> 0.386922: 1.0832x slower
Avg: 0.359126 -> 0.387653: 1.0794x slower
Significant (t=-236.009266, a=0.95)
Stddev: 0.00114 -> 0.00041: 2.7503x smaller
Timeline: http://tinyurl.com/yhjcugv

### slowspitfire ###
Min: 0.349260 -> 0.379709: 1.0872x slower
Avg: 0.358986 -> 0.382109: 1.0644x slower
Significant (t=-29.946725, a=0.95)
Stddev: 0.00663 -> 0.00396: 1.6731x smaller
Timeline: http://tinyurl.com/ydc664k

### slowunpickle ###
Min: 0.179297 -> 0.173357: 1.0343x faster
Avg: 0.183116 -> 0.174025: 1.0522x faster
Significant (t=4.697301, a=0.95)
Stddev: 0.01935 -> 0.00052: 37.4681x smaller
Timeline: http://tinyurl.com/ye5jt5l

### spambayes ###
Min: 0.184013 -> 0.179496: 1.0252x faster
Avg: 0.184262 -> 0.179841: 1.0246x faster
Significant (t=81.579194, a=0.95)
Stddev: 0.00042 -> 0.00034: 1.2604x smaller
Timeline: http://tinyurl.com/y8ln57t

I'm rerunning with the addition of the apps and startup benchmarks against 
trunk tonight.

Original comment by reid.kle...@gmail.com on 21 Jan 2010 at 7:25

GoogleCodeExporter commented 8 years ago

I think a reasonable way to see this change in action would be to tune the 
hotness
threshold very low and run the benchmarks.  The threaded JIT should trounce the 
JIT
that blocks on compilation.

More complete benchmark results below.  I made these while using my desktop, so 
they
may not be accurate.  They don't show a degradation, so I think we're good to 
go.

[reid@muikyl unladen-test-orig]$ ./perf.py -r -b default,apps,startup
../unladen-trunk/python ../unladen-threaded/python
Running 2to3...
INFO:root:Running ../unladen-threaded/python ./lib/2to3/2to3 -f all ./lib/2to3
INFO:root:Running `['../unladen-threaded/python', './lib/2to3/2to3', '-f', 
'all',
'./lib/2to3']` 5 times
INFO:root:Running ../unladen-trunk/python ./lib/2to3/2to3 -f all ./lib/2to3
INFO:root:Running `['../unladen-trunk/python', './lib/2to3/2to3', '-f', 'all',
'./lib/2to3']` 5 times
Running bzr_startup...
INFO:root:Running ../unladen-threaded/python ./lib/bazaar/bzr help
INFO:root:Running `['../unladen-threaded/python', './lib/bazaar/bzr', 'help']` 
200 times
INFO:root:Running ../unladen-trunk/python ./lib/bazaar/bzr help
INFO:root:Running `['../unladen-trunk/python', './lib/bazaar/bzr', 'help']` 200 
times
Running django...
INFO:root:Running ../unladen-threaded/python ./performance/bm_django.py -n 100
INFO:root:Running ../unladen-trunk/python ./performance/bm_django.py -n 100
Running hg_startup...
INFO:root:Running ../unladen-threaded/python ./lib/mercurial/hg help
INFO:root:Running `['../unladen-threaded/python', './lib/mercurial/hg', 
'help']` 1000
times
INFO:root:Running ../unladen-trunk/python ./lib/mercurial/hg help
INFO:root:Running `['../unladen-trunk/python', './lib/mercurial/hg', 'help']` 
1000 times
Running html5lib...
INFO:root:Running ../unladen-threaded/python ./performance/bm_html5lib.py -n 10
INFO:root:Running ../unladen-trunk/python ./performance/bm_html5lib.py -n 10
Running nbody...
INFO:root:Running ../unladen-threaded/python ./performance/bm_nbody.py -n 100
INFO:root:Running ../unladen-trunk/python ./performance/bm_nbody.py -n 100
Running normal_startup...
INFO:root:Running `['../unladen-threaded/python', '-c', '']` 2000 times
INFO:root:Running `['../unladen-trunk/python', '-c', '']` 2000 times
Running rietveld...
INFO:root:Running ../unladen-threaded/python ./performance/bm_rietveld.py -n 100
INFO:root:Running ../unladen-trunk/python ./performance/bm_rietveld.py -n 100
Running slowpickle...
INFO:root:Running ../unladen-threaded/python ./performance/bm_pickle.py -n 100 
pickle
INFO:root:Running ../unladen-trunk/python ./performance/bm_pickle.py -n 100 
pickle
Running slowspitfire...
INFO:root:Running ../unladen-threaded/python ./performance/bm_spitfire.py -n 100
--disable_psyco
INFO:root:Running ../unladen-trunk/python ./performance/bm_spitfire.py -n 100
--disable_psyco
Running slowunpickle...
INFO:root:Running ../unladen-threaded/python ./performance/bm_pickle.py -n 100 
unpickle
INFO:root:Running ../unladen-trunk/python ./performance/bm_pickle.py -n 100 
unpickle
Running spambayes...
INFO:root:Running ../unladen-threaded/python ./performance/bm_spambayes.py -n 
100
INFO:root:Running ../unladen-trunk/python ./performance/bm_spambayes.py -n 100
Running startup_nosite...
INFO:root:Running `['../unladen-threaded/python', '-S', '-c', '']` 4000 times
INFO:root:Running `['../unladen-trunk/python', '-S', '-c', '']` 4000 times

Report on Linux muikyl 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 17:01:44 UTC 
2009
x86_64 
Total CPU cores: 8

### 2to3 ###
Min: 17.120000 -> 17.210000: 1.0053x slower
Avg: 17.604000 -> 17.434000: 1.0098x faster
Not significant
Stddev: 0.42477 -> 0.26633: 1.5949x smaller
Timeline: http://tinyurl.com/y8ozbze

### bzr_startup ###
Min: 0.020000 -> 0.030000: 1.5000x slower
Avg: 0.064150 -> 0.067850: 1.0577x slower
Significant (t=-2.299637, a=0.95)
Stddev: 0.01501 -> 0.01710: 1.1388x larger
Timeline: http://tinyurl.com/ybm5oo7

### django ###
Min: 0.598146 -> 0.592612: 1.0093x faster
Avg: 0.602385 -> 0.598769: 1.0060x faster
Not significant
Stddev: 0.00522 -> 0.02636: 5.0520x larger
Timeline: http://tinyurl.com/yz2egqh

### hg_startup ###
Min: 0.000000 -> 0.000000: incomparable (one result was zero)
Avg: 0.035000 -> 0.035310: 1.0089x slower
Not significant
Stddev: 0.01077 -> 0.01029: 1.0467x smaller
Timeline: http://tinyurl.com/yc9wbgd

### html5lib ###
Min: 9.725041 -> 9.653511: 1.0074x faster
Avg: 10.315137 -> 10.613154: 1.0289x slower
Not significant
Stddev: 0.89449 -> 1.28503: 1.4366x larger
Timeline: http://tinyurl.com/ye8rrop

### nbody ###
Min: 0.173363 -> 0.149776: 1.1575x faster
Avg: 0.182891 -> 0.167447: 1.0922x faster
Significant (t=2.343296, a=0.95)
Stddev: 0.02046 -> 0.06265: 3.0628x larger
Timeline: http://tinyurl.com/ygos3rr

### normal_startup ###
Min: 0.232907 -> 0.284100: 1.2198x slower
Avg: 0.303747 -> 0.320406: 1.0548x slower
Significant (t=-3.899464, a=0.95)
Stddev: 0.03895 -> 0.01754: 2.2207x smaller
Timeline: http://tinyurl.com/yexk3uq

### rietveld ###
Min: 0.376464 -> 0.375565: 1.0024x faster
Avg: 0.451970 -> 0.436437: 1.0356x faster
Not significant
Stddev: 0.13374 -> 0.08827: 1.5152x smaller
Timeline: http://tinyurl.com/yaehud9

### slowpickle ###
Min: 0.372783 -> 0.351706: 1.0599x faster
Avg: 0.384880 -> 0.360988: 1.0662x faster
Significant (t=3.814272, a=0.95)
Stddev: 0.05241 -> 0.03431: 1.5276x smaller
Timeline: http://tinyurl.com/ydqptn9

### slowspitfire ###
Min: 0.341122 -> 0.342288: 1.0034x slower
Avg: 0.342483 -> 0.345408: 1.0085x slower
Significant (t=-4.420360, a=0.95)
Stddev: 0.00398 -> 0.00529: 1.3300x larger
Timeline: http://tinyurl.com/yzgsxo7

### slowunpickle ###
Min: 0.175040 -> 0.172964: 1.0120x faster
Avg: 0.187032 -> 0.185783: 1.0067x faster
Not significant
Stddev: 0.03330 -> 0.03227: 1.0319x smaller
Timeline: http://tinyurl.com/ya373u8

### spambayes ###
Min: 0.182354 -> 0.183657: 1.0071x slower
Avg: 0.225245 -> 0.200242: 1.1249x faster
Not significant
Stddev: 0.18383 -> 0.02427: 7.5754x smaller
Timeline: http://tinyurl.com/yazqcm5

### startup_nosite ###
Min: 0.175908 -> 0.175201: 1.0040x faster
Avg: 0.223720 -> 0.216601: 1.0329x faster
Significant (t=3.568175, a=0.95)
Stddev: 0.02089 -> 0.01896: 1.1017x smaller
Timeline: http://tinyurl.com/ycyjljl

Original comment by reid.kle...@gmail.com on 21 Jan 2010 at 6:58

GoogleCodeExporter commented 8 years ago

Ping?  I'm getting tired of merging this with trunk and rerunning all the tests 
and
benchmarks.  :-/

Original comment by reid.kle...@gmail.com on 22 Feb 2010 at 4:56

arvindm95 / unladen-swallow

Offload JIT compilation to secondary threads #40