Open GoogleCodeExporter opened 9 years ago
The current theory is that this comes from keeping the LLVM IR around after
we've
compiled the function to native code. Some preliminary measurements indicate
that
loading Django under -j always consumes over 65 MB in llvm::Functions and their
child
objects.
Original comment by collinw
on 17 Jul 2009 at 5:17
How difficult would it be to rig up a C++ heap checker? I'm not familiar with
any
tools like this, but I think we need to do a couple of things: make sure we
have no
leaks, and figure out which structures are eating all of our memory. Are there
any
tools out there for doing that?
Original comment by reid.kle...@gmail.com
on 19 Jul 2009 at 4:49
On OSX, I've used the MallocDebug tool to track down large memory users. Massif
(http://valgrind.org/docs/manual/ms-manual.html) looks even better for Linux. I
got a
recommendation for KCacheGrind a while ago, but it looks more like a speed tool
than a
memory tool.
Original comment by jyass...@gmail.com
on 19 Jul 2009 at 5:36
Purify (now IBM Rational Purify,
http://www-01.ibm.com/software/awdtools/purify/)
used to be an excellent tool for finding issues in memory management. It didn't
need
source code instrumentation, actually it didn't need source code at all. It
changed
the assembly, injecting code around memory access instructions, to trap and
monitor
memory access, so it could be used on 3rd party code as well.
Original comment by qx0mons...@googlemail.com
on 20 Jul 2009 at 8:24
Original comment by collinw
on 21 Jul 2009 at 5:50
The theory about LLVM is correct: r775 makes -j always compile all executed
functions, rather than all functions present in the system. This has a
significant
effect on memory usage. Measuring using `time -l` on Darwin, max RSS for
bm_django.py -n 1 with -j always goes from 165MB to 70MB.
Analysis using a prototype _llvm.dump_stats():
# Old -j always (at definition)
LLVM functions: 1833
LLVM basic blocks: 229867
LLVM instructions: 1321540
LLVM data size: 63159048 bytes
# New -j always (at execution)
LLVM functions: 342
LLVM basic blocks: 39870
LLVM instructions: 267272
LLVM data size: 12479360 bytes
There's still plenty of headroom here, and I have a CL to further reduce memory
usage in progress. I need to rerun perf.py --track_memory to see the full
effect on
all benchmarks.
Original comment by collinw
on 24 Jul 2009 at 4:22
Benchmarking the memory usage of r774 vs r775 with --track_memory on Linux
2.6.18.5:
perf.py command: ./perf.py --track_memory --args "-j always" -b
django,slowspitfire,slowpickle,slowunpickle,2to3 ../a/python ../b/python
Note that in the graphs, "../a/python" is the control binary (r774),
"../b/python" is
the experiment binary (r775). This confirms that this is a significant
reduction in
memory usage for -j always.
2to3:
59.246994 -> 57.129315: 3.71% faster
Mem max: 141136.000 -> 85224.000: 65.61% smaller
Usage over time: http://tinyurl.com/nlkf6g
django:
Min: 1.122529 -> 1.075815: 4.34% faster
Avg: 1.128422 -> 1.077450: 4.73% faster
Significant (t=45.483821, a=0.95)
Stddev: 0.00721 -> 0.00330: 118.44% smaller
Mem max: 158816.000 -> 67560.000: 135.07% smaller
Usage over time: http://tinyurl.com/m7urd2
slowpickle:
Min: 0.706008 -> 0.702213: 0.54% faster
Avg: 0.708631 -> 0.703975: 0.66% faster
Significant (t=5.179222, a=0.95)
Stddev: 0.00575 -> 0.00271: 112.51% smaller
Mem max: 96592.000 -> 68360.000: 41.30% smaller
Usage over time: http://tinyurl.com/l4lk7s
slowspitfire:
Min: 0.966542 -> 0.796440: 21.36% faster
Avg: 0.981962 -> 0.802562: 22.35% faster
Significant (t=119.996555, a=0.95)
Stddev: 0.00879 -> 0.00588: 49.42% smaller
Mem max: 212400.000 -> 132880.000: 59.84% smaller
Usage over time: http://tinyurl.com/lncr8a
slowunpickle:
Min: 0.309612 -> 0.308172: 0.47% faster
Avg: 0.310206 -> 0.308482: 0.56% faster
Significant (t=29.988479, a=0.95)
Stddev: 0.00036 -> 0.00018: 98.98% smaller
Mem max: 96592.000 -> 68360.000: 41.30% smaller
Usage over time: http://tinyurl.com/npo6rs
Note too that the performance improvement for slowspitfire doesn't hold when
rerun
without -j always and with --rigorous.
Original comment by collinw
on 24 Jul 2009 at 10:34
Mailed out a change for review that will throw away the LLVM IR after native
code
generation. This looks like a promising reduction for -j always.
Original comment by collinw
on 4 Aug 2009 at 1:15
With the new hotness model introduced in r862, memory usage has decreased
significantly vs the Q2 release with -j always:
$ ./perf.py -r -b slowspitfire --args "-j always," --track_memory ../q2/python
../q3/python
Mem max: 212344.000 -> 96884.000: 119.17% smaller
Usage over time: http://tinyurl.com/yfy3w3p
$ ./perf.py -r -b ai --args "-j always," --track_memory ../q2/python
../q3/python
Mem max: 95012.000 -> 14020.000: 577.69% smaller
Usage over time: http://tinyurl.com/yz7v4xj
$ ./perf.py -r -b slowpickle --args "-j always," --track_memory ../q2/python
../q3/python
Mem max: 96876.000 -> 18996.000: 409.98% smaller
Usage over time: http://tinyurl.com/yf4a3sj
$ ./perf.py -r -b slowunpickle --args "-j always," --track_memory ../q2/python
../q3/python
Mem max: 96876.000 -> 14076.000: 588.24% smaller
Usage over time: http://tinyurl.com/yfzv2mn
$ ./perf.py -r -b django --args "-j always," --track_memory ../q2/python
../q3/python
Mem max: 159064.000 -> 27160.000: 485.66% smaller
Usage over time: http://tinyurl.com/ykdmdml
$ ./perf.py -r -b rietveld --args "-j always," --track_memory ../q2/python
../q3/python
Mem max: 575116.000 -> 55952.000: 927.87% smaller
Usage over time: http://tinyurl.com/yf3rcbb
We still have work to do to get closer to the 2.6 baseline (example):
$ ./perf.py -r -b rietveld --track_memory ../26/python ../q3/python
Mem max: 14948.000 -> 55952.000: 73.28% larger
Usage over time: http://tinyurl.com/ykcsbcs
That's not great, but it's way better than it used to be. Pushing future work
on this to Q4.
Original comment by collinw
on 12 Oct 2009 at 8:42
r876 deletes LLVM IR after compiling to native code. This is a ~5% drop in
memory for most benchmarks, but Rietveld sees a 37% drop:
Before: >>> _llvm.dump_stats()
LLVM functions: 279
LLVM basic blocks: 25079
LLVM instructions: 136274
After: >>> _llvm.dump_stats()
LLVM functions: 279
LLVM basic blocks: 45
LLVM instructions: 209
Mem max: 55936.000 -> 40804.000: 37.08% smaller
Usage over time: http://tinyurl.com/yj4esk2
See http://codereview.appspot.com/99044 for more stats.
Rietveld vs Python 2.6.1 (./perf.py -r -b rietveld --track_memory ../26/python
../obj/python):
Mem max: 14884.000 -> 40040.000: 62.83% larger
Usage over time: http://tinyurl.com/yj2nm3u
That's heading the right direction.
Original comment by collinw
on 26 Oct 2009 at 10:17
Original comment by collinw
on 6 Jan 2010 at 11:44
Most recent memory usage benchmarking (Q3 release vs trunk):
Biggest improvement:
### slowpickle ###
Mem max: 20048.000 -> 17732.000: 1.1306x smaller
Usage over time: http://tinyurl.com/y8cbek4
### spambayes ###
Mem max: 52200.000 -> 46632.000: 1.1194x smaller
Usage over time: http://tinyurl.com/ybu9jo8
Total:
### 2to3 ###
Mem max: 46776.000 -> 45940.000: 1.0182x smaller
Usage over time: http://tinyurl.com/yavvxtx
### django ###
Mem max: 26324.000 -> 26496.000: 1.0065x larger
Usage over time: http://tinyurl.com/ye7wua8
### html5lib ###
Mem max: 178044.000 -> 173952.000: 1.0235x smaller
Usage over time: http://tinyurl.com/yaw2qs3
### nbody ###
Mem max: 15852.000 -> 17372.000: 1.0959x larger
Usage over time: http://tinyurl.com/ybgurwh
### slowpickle ###
Mem max: 20048.000 -> 17732.000: 1.1306x smaller
Usage over time: http://tinyurl.com/y8cbek4
### slowspitfire ###
Mem max: 96972.000 -> 97652.000: 1.0070x larger
Usage over time: http://tinyurl.com/ydx4q8z
### slowunpickle ###
Mem max: 15144.000 -> 14844.000: 1.0202x smaller
Usage over time: http://tinyurl.com/y8doccw
### spambayes ###
Mem max: 52200.000 -> 46632.000: 1.1194x smaller
Usage over time: http://tinyurl.com/ybu9jo8
Original comment by collinw
on 11 Jan 2010 at 7:26
TODO: finish http://codereview.appspot.com/97120/show, commit it.
Original comment by collinw
on 15 Jan 2010 at 4:54
r1034 lazily initializes the runtime feedback DenseMaps, which turns out to be
a nice win on larger codebases:
$ ./perf.py -f -m ../unladen-trunk/python ../unladen-swallow/python
Report on Linux muikyl 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 17:01:44 UTC
2009 x86_64
Total CPU cores: 8
### 2to3 ###
Mem max: 35596.000 -> 33584.000: 1.0599x smaller
Usage over time: http://tinyurl.com/ycn2n3u
### django ###
Mem max: 36844.000 -> 31936.000: 1.1537x smaller
Usage over time: http://tinyurl.com/yfl4x5g
### nbody ###
Mem max: 22336.000 -> 21516.000: 1.0381x smaller
Usage over time: http://tinyurl.com/yeghupn
### slowpickle ###
Mem max: 22060.000 -> 20920.000: 1.0545x smaller
Usage over time: http://tinyurl.com/y9cltou
### slowspitfire ###
Mem max: 148188.000 -> 145324.000: 1.0197x smaller
Usage over time: http://tinyurl.com/ygzvjjm
### slowunpickle ###
Mem max: 16048.000 -> 14992.000: 1.0704x smaller
Usage over time: http://tinyurl.com/ybrfyta
### spambayes ###
Mem max: 39160.000 -> 35464.000: 1.1042x smaller
Usage over time: http://tinyurl.com/yeuj4ga
Original comment by collinw
on 25 Jan 2010 at 6:42
r1044 shaved ~350kb off memory usage across the board. Another low-hanging
sliver
down, more to go.
Original comment by collinw
on 27 Jan 2010 at 1:52
Idea from Joerg: introduce a two-phase hotness model, where we only collect
feedback
after code objects have passed a certain minimum hotness threshold. This will
take
some careful tuning, since if the feedback threshold is too low, we may
introduce more
bail sites and degrade performance.
Original comment by collinw
on 27 Jan 2010 at 7:34
I've been experimenting with a two-phase hotness model, where we only gather
feedback if
co_hotness > 50 (or 100, or some other low number). I've added instrumentation
to gather the
number of feedback maps created:
$ ./perf.py -b nqueens --diff_instrumentation
Feedback maps created:
../a-inst/python
N: 149
../b-inst/python
N: 49
$ ./perf.py -b django --diff_instrumentation
Feedback maps created:
../a-inst/python
N: 526
../b-inst/python
N: 110
So we're definitely reducing the number of feedback maps created. But it
doesn't matter:
$ ./perf.py -b nqueens --track_memory ../a/python ../b/python
### nqueens ###
Mem max: 14224.000 -> 14100.000: 1.0088x smaller
Usage over time: http://tinyurl.com/ykzofp2
$ ./perf.py -b slowspitfire --track_memory ../a/python ../b/python
### slowspitfire ###
Mem max: 95540.000 -> 94464.000: 1.0114x smaller
Usage over time: http://tinyurl.com/ydzt7t8
$ ./perf.py -b django --track_memory ../a/python ../b/python
### django ###
Mem max: 23752.000 -> 23576.000: 1.0075x smaller
Usage over time: http://tinyurl.com/yeqgk5n
Reducing the amount of data also results in bad branch predictions and other
optimization
degradations.
$ pprof --text ./python unladen-heap.0006.heap | head
Total: 6.0 MB
3.0 49.9% 49.9% 3.0 49.9% new_arena (inline)
1.6 26.3% 76.2% 1.6 26.3% llvm::DenseMap::init (inline)
We use DenseMap for feedback, but we're not the only client: LLVM uses it all
over the place. It
may be that some optimization passes are keeping around data longer than we
need/want them to.
I'm not sure our feedback maps are the culprit here.
Original comment by collinw
on 4 Feb 2010 at 9:21
Follow up from pprof about feedback maps:
$ pprof --text ./python /tmp/control-nqueens.hprof.0003.heap | grep
FeedbackMap_New
0.0 0.1% 99.6% 0.0 4.6% PyFeedbackMap_New
$ pprof --text ./python /tmp/control-django.hprof.0002.heap | grep
FeedbackMap_New
0.0 0.1% 99.3% 0.2 4.9% PyFeedbackMap_New
$ pprof --text ./python /tmp/control-2to3.hprof.0003.heap | grep
FeedbackMap_New
0.0 0.0% 99.7% 0.2 2.4% PyFeedbackMap_New
At 200k max, I don't think that's the problem. I'm mining pprof/tcmalloc data
for more leads.
Be sure to run Unladen with -g; that changes the profile.
Original comment by collinw
on 5 Feb 2010 at 2:09
Original comment by collinw
on 21 May 2010 at 6:58
Original issue reported on code.google.com by
collinw
on 15 Jul 2009 at 8:44