FDO: Python/Python inlining

GoogleCodeExporter commented 8 years ago

The current (as of r862) hotness heuristic is flagging lots of trivial 
functions as hot. These should be inlined into their callers.

Data from slowpickle:

python performance/bm_pickle.py -n 10 --profile pickle
================================================================================

         17576425 function calls (16332114 primitive calls) in 17.017 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1248312/12003    4.575    0.000   16.801    0.001 pickle.py:269(save)
   588147    2.020    0.000    3.266    0.000 pickle.py:227(memoize)
   336084    1.556    0.000    2.115    0.000 pickle.py:1260(encode_long)
   528132    1.492    0.000    4.853    0.000 pickle.py:480(save_string)
    16004    1.175    0.000   14.736    0.001 pickle.py:655(_batch_setitems)
   336084    0.901    0.000    3.296    0.000 pickle.py:461(save_long)
  1924481    0.604    0.000    0.604    0.000 {method 'write' of 'cStringIO.StringO' objects}
  2224556    0.582    0.000    0.582    0.000 {method 'get' of 'dict' objects}
  1788447    0.573    0.000    0.573    0.000 {chr}
   588147    0.567    0.000    0.747    0.000 pickle.py:250(put)
  2436609    0.399    0.000    0.399    0.000 {id}
  1952488    0.376    0.000    0.376    0.000 {len}
16004/12003    0.344    0.000   12.386    0.001 pickle.py:608(_batch_appends)
   288072    0.279    0.000    0.374    0.000 pickle.py:260(get)
  1248312    0.250    0.000    0.250    0.000 pickle.py:333(persistent_id)

Heuristic results: 17 code objects deemed hot
        pickle.py:333 (persistent_id)   12483120
        pickle.py:269 (save)            12483120
        pickle.py:227 (memoize)          5881470
        pickle.py:250 (put)              5881470
        pickle.py:480 (save_string)      5281320
        pickle.py:461 (save_long)        3360840
        pickle.py:1260 (encode_long)     3360840
        pickle.py:260 (get)              2880720
        pickle.py:437 (save_int)          360090
        pickle.py:655 (_batch_setitems)   247073
        pickle.py:608 (_batch_appends)    226143
        pickle.py:591 (save_list)         160040
        pickle.py:640 (save_dict)         160040
        pickle.py:532 (save_tuple)        131794
        pickle.py:220 (dump)              120030
        pickle.py:173 (__init__)          120030
        pickle.py:1364 (dumps)            120030

The heuristic weights persistent_id() heavily because of the number of function 
calls, even though all those calls only account for .25 seconds out of 17 
seconds of 
total running time; same thing for get() and put().

Django has a similar problem:

python performance/bm_django.py -n 10 --profile
================================================================================

         17628021 function calls (17624409 primitive calls) in 23.681 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1895436    2.961    0.000    4.072    0.000 encoding.py:37(force_unicode)
  7853604    2.897    0.000    2.897    0.000 {isinstance}
  1812/12    2.476    0.001   23.652    1.971 defaulttags.py:115(render)
   271812    1.652    0.000    9.782    0.000 __init__.py:533(resolve)
   270000    1.638    0.000    3.032    0.000 defaultfilters.py:375(escape)
   270000    1.266    0.000   17.483    0.000 __init__.py:800(render)
   270000    1.258    0.000    5.929    0.000 defaultfilters.py:34(_dec)
  1824/12    1.254    0.001   23.678    1.973 __init__.py:764(render)
   270000    1.163    0.000    5.350    0.000 functional.py:246(wrapper)
  1350000    1.104    0.000    1.104    0.000 {method 'replace' of 'unicode' objects}
   270000    1.078    0.000    3.943    0.000 html.py:30(escape)
   271824    0.885    0.000    1.334    0.000 safestring.py:89(mark_safe)
   270000    0.882    0.000    1.394    0.000 safestring.py:104(mark_for_escaping)
   540000    0.670    0.000    0.670    0.000 {getattr}
   271812    0.552    0.000    0.904    0.000 __init__.py:690(_resolve_lookup)
   271812    0.396    0.000    1.300    0.000 __init__.py:672(resolve)
   273612    0.353    0.000    0.353    0.000 context.py:38(__getitem__)
  1630846    0.322    0.000    0.322    0.000 {method 'append' of 'list' objects}
   271812    0.264    0.000    0.264    0.000 {hasattr}
   543624    0.253    0.000    0.253    0.000 __init__.py:790(render)
   273612    0.182    0.000    0.182    0.000 context.py:34(__setitem__)

Heuristic results: 16 code objects deemed hot
        django/utils/encoding.py:37 (force_unicode)         18954370
        django/template/__init__.py:790 (render)             5436240
        django/template/context.py:38 (__getitem__)          2745266
        django/template/context.py:34 (__setitem__)          2736120
        django/template/__init__.py:690 (_resolve_lookup)    2727211
        django/template/__init__.py:533 (resolve)            2727154
        django/utils/safestring.py:89 (mark_safe)            2718240
        django/template/__init__.py:672 (resolve)            2718120
        django/utils/functional.py:246 (wrapper)             2709091
        django/template/__init__.py:800 (render)             2700000
        django/utils/safestring.py:104 (mark_for_escaping)   2700000
        django/utils/html.py:30 (escape)                     2700000
        django/template/defaultfilters.py:34 (_dec)          2700000
        django/template/defaultfilters.py:375 (escape)       2700000
        django/template/__init__.py:764 (render)              117246
        django/template/defaulttags.py:115 (render)           117120

The __init__.py:790 (render) function should not optimized independently, nor 
should context.py's __getitem__() and __setitem__() calls.

The proper way to optimize these leaf functions is to inline them into their 
parents. We need a way to create frames lazily to achieve maximum benefit, 
possibly as 
part of PyTraceback_Here() (or colocated machinery). As long as tracebacks 
contain all the right frames, users of sys._getframe() get what they deserve if 
they're 
lookup up the call stack and making assertions about it.

Original issue reported on code.google.com by collinw on 9 Oct 2009 at 10:03

Blocking: #105
Blocked on: #41

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 6 Jan 2010 at 11:44

Removed labels: Release-2009Q4

GoogleCodeExporter commented 8 years ago

The html5lib benchmark also has a lot of opportunities for inlining.

Original comment by collinw on 12 Jan 2010 at 2:19

GoogleCodeExporter commented 8 years ago

Joerg Blank has started on this.

Original comment by collinw on 19 Jan 2010 at 10:43

Changed state: Started

GoogleCodeExporter commented 8 years ago

This is now blocked by Issue 41.

Running html5lib with inlineing enabled generates some callsites which become
polymorphic and cause a lot of bails.

Original comment by joerg...@gmail.com on 28 Jan 2010 at 11:01

GoogleCodeExporter commented 8 years ago

Can you elaborate on this? How does the inlining make callsites *more* 
polymorphic? I assume 
you're talking about callsites in the callee (function being inlined)? What 
kind of callsites? Which 
optimization is causing the bails?

Original comment by collinw on 28 Jan 2010 at 6:42

GoogleCodeExporter commented 8 years ago

Please excuse my bad wording.

A callsite is regarded as monomorphic and the called function gets subsequently
inlined. Later the same site becomes polymorphic and causes 100k bails in a 
single
round (Up from about 5k from cold branches).

Original comment by joerg...@gmail.com on 28 Jan 2010 at 9:41

arvindm95 / unladen-swallow

FDO: Python/Python inlining #86