Memory usage increased significantly

GoogleCodeExporter commented 8 years ago

I've seen some cases where 2009Q2 uses 10x as much memory as 
2009Q1. We need to track this down and eliminate as much of this bloat as 
possible. There will be some increase necessary since we're emitting 
machine code, but that doesn't account for a 10x blowup.

Things to look into:
- Memory blow up with ./python -j whenhot
- Memory blow up with ./python -j always
- Memory blow up just from linking in LLVM
- How much memory goes into the machine code vs the LLVM IR? The 
optimization passes?

perf.py's --track_memory option can help get rough estimates and to keep 
track of macro-level progress.

Original issue reported on code.google.com by collinw on 15 Jul 2009 at 8:44

GoogleCodeExporter commented 8 years ago

The current theory is that this comes from keeping the LLVM IR around after 
we've 
compiled the function to native code. Some preliminary measurements indicate 
that 
loading Django under -j always consumes over 65 MB in llvm::Functions and their 
child 
objects.

Original comment by collinw on 17 Jul 2009 at 5:17

GoogleCodeExporter commented 8 years ago

How difficult would it be to rig up a C++ heap checker?  I'm not familiar with 
any
tools like this, but I think we need to do a couple of things: make sure we 
have no
leaks, and figure out which structures are eating all of our memory.  Are there 
any
tools out there for doing that?

Original comment by reid.kle...@gmail.com on 19 Jul 2009 at 4:49

GoogleCodeExporter commented 8 years ago

On OSX, I've used the MallocDebug tool to track down large memory users. Massif 
(http://valgrind.org/docs/manual/ms-manual.html) looks even better for Linux. I 
got a 
recommendation for KCacheGrind a while ago, but it looks more like a speed tool 
than a 
memory tool.

Original comment by jyass...@gmail.com on 19 Jul 2009 at 5:36

GoogleCodeExporter commented 8 years ago

Purify (now IBM Rational Purify, 
http://www-01.ibm.com/software/awdtools/purify/) 
used to be an excellent tool for finding issues in memory management. It didn't 
need 
source code instrumentation, actually it didn't need source code at all. It 
changed 
the assembly, injecting code around memory access instructions, to trap and 
monitor 
memory access, so it could be used on 3rd party code as well.

Original comment by qx0mons...@googlemail.com on 20 Jul 2009 at 8:24

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 21 Jul 2009 at 5:50

Changed state: Started

GoogleCodeExporter commented 8 years ago

The theory about LLVM is correct: r775 makes -j always compile all executed 
functions, rather than all functions present in the system. This has a 
significant 
effect on memory usage. Measuring using `time -l` on Darwin, max RSS for 
bm_django.py -n 1 with -j always goes from 165MB to 70MB.

Analysis using a prototype _llvm.dump_stats():
# Old -j always (at definition)
LLVM functions: 1833
LLVM basic blocks: 229867
LLVM instructions: 1321540
LLVM data size: 63159048 bytes

# New -j always (at execution)
LLVM functions: 342
LLVM basic blocks: 39870
LLVM instructions: 267272
LLVM data size: 12479360 bytes

There's still plenty of headroom here, and I have a CL to further reduce memory 
usage in progress. I need to rerun perf.py --track_memory to see the full 
effect on 
all benchmarks.

Original comment by collinw on 24 Jul 2009 at 4:22

GoogleCodeExporter commented 8 years ago

Benchmarking the memory usage of r774 vs r775 with --track_memory on Linux 
2.6.18.5:

perf.py command: ./perf.py --track_memory --args "-j always" -b 
django,slowspitfire,slowpickle,slowunpickle,2to3 ../a/python ../b/python

Note that in the graphs, "../a/python" is the control binary (r774), 
"../b/python" is 
the experiment binary (r775). This confirms that this is a significant 
reduction in 
memory usage for -j always.

2to3:
59.246994 -> 57.129315: 3.71% faster
Mem max: 141136.000 -> 85224.000: 65.61% smaller
Usage over time: http://tinyurl.com/nlkf6g

django:
Min: 1.122529 -> 1.075815: 4.34% faster
Avg: 1.128422 -> 1.077450: 4.73% faster
Significant (t=45.483821, a=0.95)
Stddev: 0.00721 -> 0.00330: 118.44% smaller

Mem max: 158816.000 -> 67560.000: 135.07% smaller
Usage over time: http://tinyurl.com/m7urd2

slowpickle:
Min: 0.706008 -> 0.702213: 0.54% faster
Avg: 0.708631 -> 0.703975: 0.66% faster
Significant (t=5.179222, a=0.95)
Stddev: 0.00575 -> 0.00271: 112.51% smaller

Mem max: 96592.000 -> 68360.000: 41.30% smaller
Usage over time: http://tinyurl.com/l4lk7s

slowspitfire:
Min: 0.966542 -> 0.796440: 21.36% faster
Avg: 0.981962 -> 0.802562: 22.35% faster
Significant (t=119.996555, a=0.95)
Stddev: 0.00879 -> 0.00588: 49.42% smaller

Mem max: 212400.000 -> 132880.000: 59.84% smaller
Usage over time: http://tinyurl.com/lncr8a

slowunpickle:
Min: 0.309612 -> 0.308172: 0.47% faster
Avg: 0.310206 -> 0.308482: 0.56% faster
Significant (t=29.988479, a=0.95)
Stddev: 0.00036 -> 0.00018: 98.98% smaller

Mem max: 96592.000 -> 68360.000: 41.30% smaller
Usage over time: http://tinyurl.com/npo6rs

Note too that the performance improvement for slowspitfire doesn't hold when 
rerun 
without -j always and with --rigorous.

Original comment by collinw on 24 Jul 2009 at 10:34

GoogleCodeExporter commented 8 years ago

Mailed out a change for review that will throw away the LLVM IR after native 
code 
generation. This looks like a promising reduction for -j always.

Original comment by collinw on 4 Aug 2009 at 1:15

GoogleCodeExporter commented 8 years ago

With the new hotness model introduced in r862, memory usage has decreased 
significantly vs the Q2 release with -j always:

$ ./perf.py -r -b slowspitfire --args "-j always," --track_memory ../q2/python 
../q3/python 
Mem max: 212344.000 -> 96884.000: 119.17% smaller
Usage over time: http://tinyurl.com/yfy3w3p

$ ./perf.py -r -b ai --args "-j always," --track_memory ../q2/python 
../q3/python 
Mem max: 95012.000 -> 14020.000: 577.69% smaller
Usage over time: http://tinyurl.com/yz7v4xj

$ ./perf.py -r -b slowpickle --args "-j always," --track_memory ../q2/python 
../q3/python 
Mem max: 96876.000 -> 18996.000: 409.98% smaller
Usage over time: http://tinyurl.com/yf4a3sj

$ ./perf.py -r -b slowunpickle --args "-j always," --track_memory ../q2/python 
../q3/python 
Mem max: 96876.000 -> 14076.000: 588.24% smaller
Usage over time: http://tinyurl.com/yfzv2mn

$ ./perf.py -r -b django --args "-j always," --track_memory ../q2/python 
../q3/python 
Mem max: 159064.000 -> 27160.000: 485.66% smaller
Usage over time: http://tinyurl.com/ykdmdml

$ ./perf.py -r -b rietveld --args "-j always," --track_memory ../q2/python 
../q3/python 
Mem max: 575116.000 -> 55952.000: 927.87% smaller
Usage over time: http://tinyurl.com/yf3rcbb

We still have work to do to get closer to the 2.6 baseline (example):

$ ./perf.py -r -b rietveld --track_memory ../26/python ../q3/python 
Mem max: 14948.000 -> 55952.000: 73.28% larger
Usage over time: http://tinyurl.com/ykcsbcs

That's not great, but it's way better than it used to be. Pushing future work 
on this to Q4.

Original comment by collinw on 12 Oct 2009 at 8:42

Added labels: Release-2009Q4
Removed labels: Release-2009Q3

GoogleCodeExporter commented 8 years ago

r876 deletes LLVM IR after compiling to native code. This is a ~5% drop in 
memory for most benchmarks, but Rietveld sees a 37% drop:

Before: >>> _llvm.dump_stats()
LLVM functions: 279
LLVM basic blocks: 25079
LLVM instructions: 136274

After: >>> _llvm.dump_stats()
LLVM functions: 279
LLVM basic blocks: 45
LLVM instructions: 209

Mem max: 55936.000 -> 40804.000: 37.08% smaller
Usage over time: http://tinyurl.com/yj4esk2

See http://codereview.appspot.com/99044 for more stats.

Rietveld vs Python 2.6.1 (./perf.py -r -b rietveld --track_memory ../26/python 
../obj/python):

Mem max: 14884.000 -> 40040.000: 62.83% larger
Usage over time: http://tinyurl.com/yj2nm3u

That's heading the right direction.

Original comment by collinw on 26 Oct 2009 at 10:17

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 6 Jan 2010 at 11:44

Added labels: Release-Merger
Removed labels: Release-2009Q4

GoogleCodeExporter commented 8 years ago

Most recent memory usage benchmarking (Q3 release vs trunk):

Biggest improvement:

### slowpickle ###
Mem max: 20048.000 -> 17732.000: 1.1306x smaller
Usage over time: http://tinyurl.com/y8cbek4

### spambayes ###
Mem max: 52200.000 -> 46632.000: 1.1194x smaller
Usage over time: http://tinyurl.com/ybu9jo8

Total:

### 2to3 ###
Mem max: 46776.000 -> 45940.000: 1.0182x smaller
Usage over time: http://tinyurl.com/yavvxtx

### django ###
Mem max: 26324.000 -> 26496.000: 1.0065x larger
Usage over time: http://tinyurl.com/ye7wua8

### html5lib ###
Mem max: 178044.000 -> 173952.000: 1.0235x smaller
Usage over time: http://tinyurl.com/yaw2qs3

### nbody ###
Mem max: 15852.000 -> 17372.000: 1.0959x larger
Usage over time: http://tinyurl.com/ybgurwh

### slowpickle ###
Mem max: 20048.000 -> 17732.000: 1.1306x smaller
Usage over time: http://tinyurl.com/y8cbek4

### slowspitfire ###
Mem max: 96972.000 -> 97652.000: 1.0070x larger
Usage over time: http://tinyurl.com/ydx4q8z

### slowunpickle ###
Mem max: 15144.000 -> 14844.000: 1.0202x smaller
Usage over time: http://tinyurl.com/y8doccw

### spambayes ###
Mem max: 52200.000 -> 46632.000: 1.1194x smaller
Usage over time: http://tinyurl.com/ybu9jo8

Original comment by collinw on 11 Jan 2010 at 7:26

GoogleCodeExporter commented 8 years ago

TODO: finish http://codereview.appspot.com/97120/show, commit it.

Original comment by collinw on 15 Jan 2010 at 4:54

GoogleCodeExporter commented 8 years ago

r1034 lazily initializes the runtime feedback DenseMaps, which turns out to be 
a nice win on larger codebases:

$ ./perf.py -f -m ../unladen-trunk/python ../unladen-swallow/python

Report on Linux muikyl 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 17:01:44 UTC
2009 x86_64
Total CPU cores: 8

### 2to3 ###
Mem max: 35596.000 -> 33584.000: 1.0599x smaller
Usage over time: http://tinyurl.com/ycn2n3u

### django ###
Mem max: 36844.000 -> 31936.000: 1.1537x smaller
Usage over time: http://tinyurl.com/yfl4x5g

### nbody ###
Mem max: 22336.000 -> 21516.000: 1.0381x smaller
Usage over time: http://tinyurl.com/yeghupn

### slowpickle ###
Mem max: 22060.000 -> 20920.000: 1.0545x smaller
Usage over time: http://tinyurl.com/y9cltou

### slowspitfire ###
Mem max: 148188.000 -> 145324.000: 1.0197x smaller
Usage over time: http://tinyurl.com/ygzvjjm

### slowunpickle ###
Mem max: 16048.000 -> 14992.000: 1.0704x smaller
Usage over time: http://tinyurl.com/ybrfyta

### spambayes ###
Mem max: 39160.000 -> 35464.000: 1.1042x smaller
Usage over time: http://tinyurl.com/yeuj4ga

Original comment by collinw on 25 Jan 2010 at 6:42

GoogleCodeExporter commented 8 years ago

r1044 shaved ~350kb off memory usage across the board. Another low-hanging 
sliver 
down, more to go.

Original comment by collinw on 27 Jan 2010 at 1:52

GoogleCodeExporter commented 8 years ago

Idea from Joerg: introduce a two-phase hotness model, where we only collect 
feedback 
after code objects have passed a certain minimum hotness threshold. This will 
take 
some careful tuning, since if the feedback threshold is too low, we may 
introduce more 
bail sites and degrade performance.

Original comment by collinw on 27 Jan 2010 at 7:34

GoogleCodeExporter commented 8 years ago

I've been experimenting with a two-phase hotness model, where we only gather 
feedback if 
co_hotness > 50 (or 100, or some other low number). I've added instrumentation 
to gather the 
number of feedback maps created:

$ ./perf.py -b nqueens --diff_instrumentation
Feedback maps created:
../a-inst/python
N: 149

../b-inst/python
N: 49

$ ./perf.py -b django --diff_instrumentation
Feedback maps created:
../a-inst/python
N: 526

../b-inst/python
N: 110

So we're definitely reducing the number of feedback maps created. But it 
doesn't matter:

$ ./perf.py -b nqueens --track_memory ../a/python ../b/python 
### nqueens ###
Mem max: 14224.000 -> 14100.000: 1.0088x smaller
Usage over time: http://tinyurl.com/ykzofp2

$ ./perf.py -b slowspitfire --track_memory ../a/python ../b/python
### slowspitfire ###
Mem max: 95540.000 -> 94464.000: 1.0114x smaller
Usage over time: http://tinyurl.com/ydzt7t8

$ ./perf.py -b django --track_memory ../a/python ../b/python
### django ###
Mem max: 23752.000 -> 23576.000: 1.0075x smaller
Usage over time: http://tinyurl.com/yeqgk5n

Reducing the amount of data also results in bad branch predictions and other 
optimization 
degradations.

$ pprof --text ./python unladen-heap.0006.heap | head
Total: 6.0 MB
    3.0  49.9%  49.9%      3.0  49.9% new_arena (inline)
    1.6  26.3%  76.2%      1.6  26.3% llvm::DenseMap::init (inline)

We use DenseMap for feedback, but we're not the only client: LLVM uses it all 
over the place. It 
may be that some optimization passes are keeping around data longer than we 
need/want them to. 
I'm not sure our feedback maps are the culprit here.

Original comment by collinw on 4 Feb 2010 at 9:21

GoogleCodeExporter commented 8 years ago

Follow up from pprof about feedback maps:

$ pprof --text ./python /tmp/control-nqueens.hprof.0003.heap  | grep 
FeedbackMap_New
     0.0   0.1%  99.6%      0.0   4.6% PyFeedbackMap_New
$ pprof --text ./python /tmp/control-django.hprof.0002.heap  | grep 
FeedbackMap_New
     0.0   0.1%  99.3%      0.2   4.9% PyFeedbackMap_New
$ pprof --text ./python /tmp/control-2to3.hprof.0003.heap  | grep 
FeedbackMap_New
     0.0   0.0%  99.7%      0.2   2.4% PyFeedbackMap_New

At 200k max, I don't think that's the problem. I'm mining pprof/tcmalloc data 
for more leads.

Be sure to run Unladen with -g; that changes the profile.

Original comment by collinw on 5 Feb 2010 at 2:09

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 21 May 2010 at 6:58

arvindm95 / unladen-swallow

Memory usage increased significantly #68