JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.72k stars 5.48k forks source link

Compilation time regression on nightly due to #41099 #50082

Closed benlorenz closed 1 year ago

benlorenz commented 1 year ago

I have been trying to debug a regression in the compilation time for many tests of Polymake.jl, it started with this PkgEval report https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2023-04/27/report.html where the tests jumped from about 13 minutes to a timeout (>45min). The tests are mostly checking compatibility of our own vector and matrix types with the corresponding julia types (for different element types). These types are mostly defined via CxxWrap and a corresponding libcxxwrap module. In our own CI we see similar effects with nightly taking more than twice as long for the tests as 1.8 or 1.9.

I added a @time to several of the affected test groups and the output looks like this:

 22.716351 seconds (3.42 M allocations: 228.784 MiB, 0.16% gc time, 99.98% compilation time)

So this is indeed compilation time and not related to the actual computations. That timing was for this code block.

I bisected the regression to #41099 by @BioTurboNick but I don't really know how to debug this further. In these timings below I noticed most of the extra time seems to be spent in AST_COMPRESS, which went from 4% to 51%. CODEGEN_LLVM also went up a bit. But I don't really know what to make of this now, especially why or how our code is triggering this in combination with that PR.

Timings

I ran some tests on a nightly build with WITH_TIMING_COUNTS=1. I added some extra verbosity to the test sets to see more timings and disabled several test sets which do not exhibit this problem.

This is for the current master (Commit f407a4cac3):

JULIA TIMINGS
Event                    , Self Cycles (% of Total)      , Total Cycles (% of Total)     
ADD_METHOD               ,           3514143189 ( 0.08 %),           3665570689 ( 0.08 %)
AST_COMPRESS             ,        2348395690639 (51.58 %),        2348395690639 (51.58 %)
AST_UNCOMPRESS           ,          15147700000 ( 0.33 %),          15378217751 ( 0.34 %)
CODEGEN_Codeinst         ,             98448269 ( 0.00 %),         265528616248 ( 5.83 %)
CODEGEN_LLVM             ,         264305217253 ( 5.81 %),         264865339231 ( 5.82 %)
CODEGEN_Workqueue        ,            139461127 ( 0.00 %),          12221039711 ( 0.27 %)
CODEINST_COMPILE         ,           1192903873 ( 0.03 %),        2043389338089 (44.88 %)
DL_OPEN                  ,             78957620 ( 0.00 %),             78957620 ( 0.00 %)
GC                       ,             47382498 ( 0.00 %),          15914524530 ( 0.35 %)
GC_Finalizers            ,            265416755 ( 0.01 %),            299007204 ( 0.01 %)
GC_FullSweep             ,            105122446 ( 0.00 %),            105122446 ( 0.00 %)
GC_IncrementalSweep      ,           7270270064 ( 0.16 %),           7270270064 ( 0.16 %)
GC_Mark                  ,           8192644545 ( 0.18 %),           8192644545 ( 0.18 %)
GC_Stop                  ,                97773 ( 0.00 %),                97773 ( 0.00 %)
INFERENCE                ,          86903454075 ( 1.91 %),        2482940830167 (54.54 %)
INIT_MODULE              ,           8107964312 ( 0.18 %),          19631247198 ( 0.43 %)
JULIA_INIT               ,             65782478 ( 0.00 %),            237956976 ( 0.01 %)
LLVM_OPT                 ,        1059662566154 (23.28 %),        1059662566154 (23.28 %)
LLVM_ORC                 ,         716882046556 (15.75 %),        1776544612710 (39.02 %)
LOAD_Pkgimg              ,            860054924 ( 0.02 %),           7402571632 ( 0.16 %)
LOAD_Processor           ,              1568929 ( 0.00 %),              1568929 ( 0.00 %)
LOAD_Require             ,            266486681 ( 0.01 %),          28020265123 ( 0.62 %)
LOAD_Sysimg              ,            140595865 ( 0.00 %),            140595865 ( 0.00 %)
LOWERING                 ,           4938725087 ( 0.11 %),           6081574799 ( 0.13 %)
MACRO_INVOCATION         ,             20986884 ( 0.00 %),            876916164 ( 0.02 %)
METHOD_LOOKUP_FAST       ,           3206808468 ( 0.07 %),           3986616164 ( 0.09 %)
METHOD_LOOKUP_SLOW       ,             91857246 ( 0.00 %),            875939345 ( 0.02 %)
METHOD_MATCH             ,          15893573564 ( 0.35 %),          16805438247 ( 0.37 %)
PARSING                  ,            691756980 ( 0.02 %),            691756980 ( 0.02 %)
ROOT                     ,           2482796762 ( 0.05 %),        4552689821858 (100.00 %)
STAGED_FUNCTION          ,              1190547 ( 0.00 %),            300850387 ( 0.01 %)
TYPE_CACHE_INSERT        ,            601778215 ( 0.01 %),            601778215 ( 0.01 %)
TYPE_CACHE_LOOKUP        ,           2759831421 ( 0.06 %),           2759831421 ( 0.06 %)
VERIFY_Edges             ,            328049638 ( 0.01 %),           2308738847 ( 0.05 %)
VERIFY_Graph             ,             20569865 ( 0.00 %),             20569865 ( 0.00 %)
VERIFY_Methods           ,              7921156 ( 0.00 %),              7921156 ( 0.00 %)

JULIA COUNTERS
Counter                  , Value               
Invalidations            ,                 2333
HeapSize                 ,            744737644
JITSize                  ,             33254347
JITCodeSize              ,             31432685
JITDataSize              ,              1821662
ImageSize                ,            395643680

And from another build where I reverted #41099 (and three other commits which I needed to revert to avoid conflicts, see at the end of this issue [1]):

JULIA TIMINGS
Event                    , Self Cycles (% of Total)      , Total Cycles (% of Total)     
ADD_METHOD               ,           3672342454 ( 0.17 %),           3875313788 ( 0.18 %)
AST_COMPRESS             ,          84232107470 ( 3.93 %),          84232107470 ( 3.93 %)
AST_UNCOMPRESS           ,          13816531605 ( 0.64 %),          14016252819 ( 0.65 %)
CODEGEN_Codeinst         ,             89608396 ( 0.00 %),          67777565810 ( 3.16 %)
CODEGEN_LLVM             ,          66328775206 ( 3.10 %),          67204589383 ( 3.14 %)
CODEGEN_Workqueue        ,            131226434 ( 0.01 %),           7151191775 ( 0.33 %)
CODEINST_COMPILE         ,           1128842086 ( 0.05 %),        1899161320883 (88.64 %)
DL_OPEN                  ,             82986960 ( 0.00 %),             82986960 ( 0.00 %)
GC                       ,             48344120 ( 0.00 %),          15536714911 ( 0.73 %)
GC_Finalizers            ,            239514645 ( 0.01 %),            239537348 ( 0.01 %)
GC_FullSweep             ,            110201791 ( 0.01 %),            110201791 ( 0.01 %)
GC_IncrementalSweep      ,           7615966878 ( 0.36 %),           7615966878 ( 0.36 %)
GC_Mark                  ,           7522571182 ( 0.35 %),           7522571182 ( 0.35 %)
GC_Stop                  ,                93592 ( 0.00 %),                93592 ( 0.00 %)
INFERENCE                ,          84077627794 ( 3.92 %),         215491286569 (10.06 %)
INIT_MODULE              ,           8636965446 ( 0.40 %),          20269670034 ( 0.95 %)
JULIA_INIT               ,             64656364 ( 0.00 %),            228045797 ( 0.01 %)
LLVM_OPT                 ,        1091348663673 (50.93 %),        1091348663673 (50.93 %)
LLVM_ORC                 ,         738791901545 (34.48 %),        1830140565218 (85.42 %)
LOAD_Pkgimg              ,            865851913 ( 0.04 %),           7686754456 ( 0.36 %)
LOAD_Processor           ,              1495812 ( 0.00 %),              1495812 ( 0.00 %)
LOAD_Require             ,            234657082 ( 0.01 %),          28759724671 ( 1.34 %)
LOAD_Sysimg              ,            131844838 ( 0.01 %),            131844838 ( 0.01 %)
LOWERING                 ,           5515049860 ( 0.26 %),           6406164444 ( 0.30 %)
MACRO_INVOCATION         ,             22419703 ( 0.00 %),            891114584 ( 0.04 %)
METHOD_LOOKUP_FAST       ,           3087070252 ( 0.14 %),           3905972971 ( 0.18 %)
METHOD_LOOKUP_SLOW       ,             88961303 ( 0.00 %),            872603509 ( 0.04 %)
METHOD_MATCH             ,          17020788685 ( 0.79 %),          17815537140 ( 0.83 %)
PARSING                  ,            757186245 ( 0.04 %),            757186245 ( 0.04 %)
ROOT                     ,           3265830601 ( 0.15 %),        2142638130876 (100.00 %)
STAGED_FUNCTION          ,              1308655 ( 0.00 %),            309384130 ( 0.01 %)
TYPE_CACHE_INSERT        ,            619835544 ( 0.03 %),            619835544 ( 0.03 %)
TYPE_CACHE_LOOKUP        ,           2722668821 ( 0.13 %),           2722668821 ( 0.13 %)
VERIFY_Edges             ,            331854785 ( 0.02 %),           2309098244 ( 0.11 %)
VERIFY_Graph             ,             24758419 ( 0.00 %),             24758419 ( 0.00 %)
VERIFY_Methods           ,              7620717 ( 0.00 %),              7620717 ( 0.00 %)

JULIA COUNTERS
Counter                  , Value               
Invalidations            ,                 2338
HeapSize                 ,            703725988
JITSize                  ,             33254191
JITCodeSize              ,             31432569
JITDataSize              ,              1821622
ImageSize                ,            375575800

The corresponding test timings for one test group which illustrates the problem quite well. This is for master:

  Polymake.SparseMatrix                                                         | 1757   1757  12m28.0s
    Constructors/Converts                                                       |  592    592     12.2s
    Low-level operations                                                        |  340    340     16.5s
      Polymake.SparseMatrix{Int64}                                              |   39     39      0.3s
      Polymake.SparseMatrix{Polymake.Integer}                                   |   39     39      0.2s
      Polymake.SparseMatrix{Polymake.Rational}                                  |   39     39      0.4s
      Polymake.SparseMatrix{Float64}                                            |   89     89      0.3s
      Polymake.SparseMatrix{Polymake.QuadraticExtension{Polymake.Rational}}     |   39     39      0.3s
      Polymake.SparseMatrix{Polymake.OscarNumber}                               |   39     39      0.2s
      Equality                                                                  |   56     56     14.8s
    Arithmetic                                                                  |  786    786  11m58.5s
      Arithmetic1                                                               |  310    310   3m04.8s
      Arithmetic2                                                               |  260    260   2m45.7s
      Arithmetic3                                                               |   39     39     25.3s
      Arithmetic4                                                               |  140    140   5m09.4s
        Arithmetic4 Int32                                                       |   10     10     17.2s
        Arithmetic4 Int64                                                       |   10     10     15.3s
        Arithmetic4 UInt64                                                      |   10     10     19.8s
        Arithmetic4 BigInt                                                      |   10     10     21.1s
        Arithmetic4 Polymake.Integer                                            |   10     10     17.0s
        Arithmetic4 Rational{Int32}                                             |   10     10     15.9s
        Arithmetic4 Rational{Int64}                                             |   10     10     16.1s
        Arithmetic4 Rational{UInt64}                                            |   10     10     16.7s
        Arithmetic4 Rational{BigInt}                                            |   10     10     23.9s
        Arithmetic4 Polymake.Rational                                           |   10     10     16.2s
        Arithmetic4 Float32                                                     |   10     10     24.2s
        Arithmetic4 Float64                                                     |   10     10     35.9s
        Arithmetic4 BigFloat                                                    |   10     10     35.7s
        Arithmetic4 Polymake.QuadraticExtension{Polymake.Rational}              |   10     10     34.3s
      Arithmetic5                                                               |   20     20     26.0s
    findnz                                                                      |    3      3      0.4s

And with the reverted commits:

  Polymake.SparseMatrix                                                         | 1757   1757   5m00.6s
    Constructors/Converts                                                       |  592    592      7.9s
    Low-level operations                                                        |  340    340     10.9s
      Polymake.SparseMatrix{Int64}                                              |   39     39      0.3s
      Polymake.SparseMatrix{Polymake.Integer}                                   |   39     39      0.3s
      Polymake.SparseMatrix{Polymake.Rational}                                  |   39     39      0.3s
      Polymake.SparseMatrix{Float64}                                            |   89     89      0.3s
      Polymake.SparseMatrix{Polymake.QuadraticExtension{Polymake.Rational}}     |   39     39      0.3s
      Polymake.SparseMatrix{Polymake.OscarNumber}                               |   39     39      0.2s
      Equality                                                                  |   56     56      9.3s
    Arithmetic                                                                  |  786    786   4m41.0s
      Arithmetic1                                                               |  310    310   1m09.0s
      Arithmetic2                                                               |  260    260   1m08.1s
      Arithmetic3                                                               |   39     39      9.4s
      Arithmetic4                                                               |  140    140   2m02.6s
        Arithmetic4 Int32                                                       |   10     10      8.1s
        Arithmetic4 Int64                                                       |   10     10      7.0s
        Arithmetic4 UInt64                                                      |   10     10      6.8s
        Arithmetic4 BigInt                                                      |   10     10      6.8s
        Arithmetic4 Polymake.Integer                                            |   10     10      9.1s
        Arithmetic4 Rational{Int32}                                             |   10     10     10.6s
        Arithmetic4 Rational{Int64}                                             |   10     10      9.2s
        Arithmetic4 Rational{UInt64}                                            |   10     10      6.6s
        Arithmetic4 Rational{BigInt}                                            |   10     10      6.2s
        Arithmetic4 Polymake.Rational                                           |   10     10      6.6s
        Arithmetic4 Float32                                                     |   10     10      8.5s
        Arithmetic4 Float64                                                     |   10     10     12.2s
        Arithmetic4 BigFloat                                                    |   10     10     11.8s
        Arithmetic4 Polymake.QuadraticExtension{Polymake.Rational}              |   10     10     13.1s
      Arithmetic5                                                               |   20     20      8.8s
    findnz                                                                      |    3      3      0.3s

Other packages

While looking at the PkgEval report mentioned earlier I noticed that MultivariatePolynomials.jl might have similar issues, at least for some of its tests:

MutableArithmetics with polynomials in Int64 | 6478      14   6492  1m09.8s
MutableArithmetics with polynomials in Int64 | 6478      14   6492  3m23.8s

before and after, from the second block with MutableArithmetics, the last one before the tests were killed.

additional notes:

BioTurboNick commented 1 year ago

Unfortunately I won't be a lot of help on my own, though happy to do what I can with some pointers.

Is there a minimal test you could distill from this that I could use? At a minimum I can isolate which part of the PR may be responsible. Though I'd guess it's from storing MethodInstances in LineInfoNodes, somehow. But I may be a bit lost trying to figure out where in the compiler that's causing a problem.

benlorenz commented 1 year ago

I tried to find a minimal testcase, but the effect seems to disappear once I disable too many other tests. This somehow seems to depend on the amount of tests that are run before and the effect is a lot worse for the later tests:

For my previous test runs I had our container test groups like this, with SparseMatrix at the end taking very long: master:

  Polymake.Vector                          | 1539   1539   1m36.9s
  Polymake.SparseVector                    | 1704   1704   2m27.0s
  Polymake.Matrix                          | 1667   1667   8m57.0s
  Polymake.SparseMatrix                    | 1757   1757  10m54.7s

with reverts:

  Polymake.Vector                          | 1539   1539   1m17.5s
  Polymake.SparseVector                    | 1704   1704   1m00.8s
  Polymake.Matrix                          | 1667   1667   4m26.0s
  Polymake.SparseMatrix                    | 1757   1757   4m54.2s

While trying to reduce this I tried reordering the test groups and got this, SparseMatrix only taking about 15% longer, but now Vector taking 4 times as much time:

master:

  Polymake.SparseMatrix                    | 1757   1757   4m55.6s
  Polymake.Matrix                          | 1667   1667   6m19.4s
  Polymake.SparseVector                    | 1704   1704   3m29.8s
  Polymake.Vector                          | 1539   1539   4m54.0s

with reverts:

  Polymake.SparseMatrix                    | 1757   1757   4m14.1s
  Polymake.Matrix                          | 1667   1667   3m55.8s
  Polymake.SparseVector                    | 1704   1704   1m22.1s
  Polymake.Vector                          | 1539   1539   1m12.5s

Is it possible that there are some caches, lookup tables, ... that are somehow getting large and slow?

BioTurboNick commented 1 year ago

I suppose we know all the slowdown is in here: https://github.com/JuliaLang/julia/blob/c4d162ea08e37dfdf958a79051aa6f6e223606b2/src/ircode.c#L767-L858

I wonder if the timing mechanism could be extended to narrow down which specific portions are slower?

Though an educated guess would be that the linetables have more information in them and, therefore, take longer to compress? (But I may be reading this wrong...) https://github.com/JuliaLang/julia/blob/c4d162ea08e37dfdf958a79051aa6f6e223606b2/src/ircode.c#L830-L842

benlorenz commented 1 year ago

That does fit my observations, it looks like it is spending quite some time in compressing the LineInfoNodes that now store the Methods / MethodInstances instead of the name.

I repeatedly hit the running process with USR1 and quite often I get backtraces that look like this:

signal (10): User defined signal 1
jl_egal_ at /home/lorenz/software/polymake/julia/julia/julia/src/julia.h:1493 [inlined]
literal_val_id at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:82 [inlined]
jl_encode_as_indexed_root at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:106
jl_encode_value_ at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:327
jl_encode_value_ at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:358
ijl_compress_ir at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:814
maybe_compress_codeinfo at ./compiler/typeinfer.jl:360
transform_result_for_cache at ./compiler/typeinfer.jl:379 [inlined]
cache_result! at ./compiler/typeinfer.jl:405
_typeinf at ./compiler/typeinfer.jl:283
typeinf at ./compiler/typeinfer.jl:216
typeinf_edge at ./compiler/typeinfer.jl:930
abstract_call_method at ./compiler/abstractinterpretation.jl:629
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:95
abstract_call_known at ./compiler/abstractinterpretation.jl:2046
...

I also managed to hit such a backtrace with gdb which gives a bit more information:

#0  0x00007f8a4d2e7c17 in jl_egal__unboxed_ (dtag=140231691781904, b=0x7f89bc6e5500, a=0x7f8a428042e0)
    at /home/lorenz/software/polymake/julia/julia/julia/src/julia.h:1486
#1  jl_egal_ (b=0x7f89bc6e5500, a=0x7f8a428042e0)
    at /home/lorenz/software/polymake/julia/julia/julia/src/julia.h:1498
#2  literal_val_id (v=0x7f89bc6e5500, s=<optimized out>, rr=0x7ffe9addf2b0)
    at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:82
#3  jl_encode_as_indexed_root (s=s@entry=0x7ffe9addf450, v=0x7f89bc6e5500)
    at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:106
#4  0x00007f8a4d2e9176 in jl_encode_value_ (s=s@entry=0x7ffe9addf450, v=<optimized out>, 
    as_literal=as_literal@entry=0) at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:433
#5  0x00007f8a4d2e9630 in jl_encode_value_ (s=s@entry=0x7ffe9addf450, v=<optimized out>, 
    as_literal=as_literal@entry=0) at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:327
#6  0x00007f8a4d2e9a6d in jl_encode_value_ (s=s@entry=0x7ffe9addf450, v=<optimized out>, 
    as_literal=as_literal@entry=1) at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:358
#7  0x00007f8a4d2ea160 in ijl_compress_ir (m=0x7f8a3a0253d0 <jl_system_image_data+35479824>, 
    code=0x7f89b0d92380) at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:814
#8  0x00007f8a378b30b8 in julia_maybe_compress_codeinfo_38174 () at compiler/typeinfer.jl:360
#9  0x00007f8a3786f8db in transform_result_for_cache () at compiler/typeinfer.jl:379
#10 julia_cache_result!_38191 () at compiler/typeinfer.jl:405
#11 0x00007f8a37b171ac in julia__typeinf_38127 () at compiler/typeinfer.jl:283
#12 0x00007f8a3730c257 in julia_typeinf_38095 () at compiler/typeinfer.jl:216
#13 0x00007f8a37943e12 in julia_typeinf_edge_38296 () at compiler/typeinfer.jl:930
#14 0x00007f8a37bede16 in julia_abstract_call_method_36753 () at compiler/abstractinterpretation.jl:629
#15 0x00007f8a37705b73 in julia_abstract_call_gf_by_type_36579 ()
    at compiler/abstractinterpretation.jl:95
#16 0x00007f8a372143cc in julia_abstract_call_known_37367 () at compiler/abstractinterpretation.jl:2046
#17 0x00007f8a375bb2b2 in julia_abstract_call_37520 () at compiler/abstractinterpretation.jl:2122
#18 0x00007f8a3786e7e7 in julia_abstract_call_37602 () at compiler/abstractinterpretation.jl:2096
#19 0x00007f8a3730b88e in julia_abstract_call_37511 () at compiler/abstractinterpretation.jl:2279
#20 0x00007f8a375bbaef in julia_abstract_eval_call_37694 () at compiler/abstractinterpretation.jl:2299

Frame 7 is the ijl_compress_ir mentioned earlier, we are in line 814 which encodes the CodeInfo, and currently i=5, which should correspond to the linetable.

The next frame just unpacks that array.

Frame 5 is for encoding the LineInfoNode, and we are at i=1 which corresponds to the second field where the method / methodinstance should be:

324     else if (jl_typetagis(v, jl_lineinfonode_type)) {
325         write_uint8(s->s, TAG_LINEINFO);
326         for (i = 0; i < jl_datatype_nfields(jl_lineinfonode_type); i++)
327             jl_encode_value(s, jl_get_nth_field(v, i));
328     }

And then for this encoding (via jl_encode_as_indexed_root) we run into literal_val_id where we seem to compare such a MethodInstance with all method roots via jl_egal:

#1  literal_val_id (v=0x7f89bb4343d0, s=<optimized out>, rr=0x7ffe9addafd0)
    at /home/lorenz/software/polymake/julia/julia/julia/src/ircode.c:82

70  static void literal_val_id(rle_reference *rr, jl_ircode_state *s, jl_value_t *v) JL_GC_DISABLED
71  {
72      jl_array_t *rs = s->method->roots;
73      int i, l = jl_array_len(rs);
74      if (jl_is_symbol(v) || jl_is_concrete_type(v)) {
75          for (i = 0; i < l; i++) {
76              if (jl_array_ptr_ref(rs, i) == v)
77                  return tagged_root(rr, s, i);
78          }
79      }
80      else {
81          for (i = 0; i < l; i++) {
82              if (jl_egal(jl_array_ptr_ref(rs, i), v))
83                  return tagged_root(rr, s, i);
84          }
85      }
86      jl_add_method_root(s->method, jl_precompile_toplevel_module, v);
87      return tagged_root(rr, s, jl_array_len(rs) - 1);
88  }

(gdb) call jl_(jl_typeof(v))
Core.MethodInstance
(gdb) call jl_(v)
(::Type{CxxWrap.CxxWrapCore.ConstCxxRef{Polymake.LibPolymake.SparseVector{Polymake.LibPolymake.Rational}}})(Ptr{Nothing}) from (::Type{CxxWrap.CxxWrapCore.ConstCxxRef{T}})(Ptr{T} where T) where {T}
(gdb) p s->method->roots->length
$13 = 33340

This second loop seems to take quite a while, there are also quite a lot of roots. Without the change this field might have been a symbol and used the first (faster?) loop?

I wonder if the timing mechanism could be extended to narrow down which specific portions are slower?

Looks like it can, I added X(IRCOMP_LITVAL) to the timing subsystems in timing.h, added JL_TIMING(IRCOMP_LITVAL, IRCOMP_LITVAL); to the above function and got:

AST_COMPRESS             ,          19918970129 ( 0.55 %),        1487756848655 (41.14 %)
...
IRCOMP_LITVAL            ,        1467837878526 (40.59 %),        1467837878526 (40.59 %)

So if this works as expected, almost all the time inside AST_COMPRESS seems to be spent in this literal_val_id function.

PS: Line numbers correspond to commit f407a4cac3d.

Edit: For comparison I added the same timing statements to the build which has the reverted commits and I get just 2% in the new IRCOMP_LITVAL:

Event                    , Self Cycles (% of Total)      , Total Cycles (% of Total)     
ADD_METHOD               ,           3575935343 ( 0.16 %),           3767675666 ( 0.17 %)
AST_COMPRESS             ,          13102464145 ( 0.58 %),          61441313977 ( 2.74 %)
AST_UNCOMPRESS           ,          14145984302 ( 0.63 %),          14348582309 ( 0.64 %)
CODEGEN_Codeinst         ,             91633141 ( 0.00 %),          66084549691 ( 2.95 %)
CODEGEN_LLVM             ,          64617649983 ( 2.88 %),          65505328655 ( 2.92 %)
CODEGEN_Workqueue        ,            134335623 ( 0.01 %),           7532002742 ( 0.34 %)
CODEINST_COMPILE         ,           1145205447 ( 0.05 %),        1945422025519 (86.79 %)
DL_OPEN                  ,            100072739 ( 0.00 %),            100072739 ( 0.00 %)
GC                       ,             45712481 ( 0.00 %),          14239474721 ( 0.64 %)
GC_Finalizers            ,            259778430 ( 0.01 %),            259828974 ( 0.01 %)
GC_FullSweep             ,            107974667 ( 0.00 %),            107974667 ( 0.00 %)
GC_IncrementalSweep      ,           6800313932 ( 0.30 %),           6800313932 ( 0.30 %)
GC_Mark                  ,           7025563210 ( 0.31 %),           7025563210 ( 0.31 %)
GC_Stop                  ,                81457 ( 0.00 %),                81457 ( 0.00 %)
INFERENCE                ,          86722293188 ( 3.87 %),         195156770864 ( 8.71 %)
INIT_MODULE              ,          64294881505 ( 2.87 %),          86200640029 ( 3.85 %)
IRCOMP_LITVAL            ,          48338849832 ( 2.16 %),          48338849832 ( 2.16 %)
JULIA_INIT               ,             67818157 ( 0.00 %),            231329978 ( 0.01 %)
LLVM_OPT                 ,        1120071637440 (49.97 %),        1120071637440 (49.97 %)
LLVM_ORC                 ,         758003630861 (33.82 %),        1878075268301 (83.78 %)
LOAD_Pkgimg              ,            848177602 ( 0.04 %),           7388993626 ( 0.33 %)
LOAD_Processor           ,              1976376 ( 0.00 %),              1976376 ( 0.00 %)
LOAD_Require             ,            221449806 ( 0.01 %),          94751808669 ( 4.23 %)
LOAD_Sysimg              ,            131339745 ( 0.01 %),            131339745 ( 0.01 %)
LOWERING                 ,           5704511976 ( 0.25 %),           6833292884 ( 0.30 %)
MACRO_INVOCATION         ,             25892596 ( 0.00 %),           1128780908 ( 0.05 %)
METHOD_LOOKUP_FAST       ,           3149306459 ( 0.14 %),           4047836294 ( 0.18 %)
METHOD_LOOKUP_SLOW       ,             38612337 ( 0.00 %),            828309377 ( 0.04 %)
METHOD_MATCH             ,          17309173074 ( 0.77 %),          17875621316 ( 0.80 %)
PARSING                  ,            745536025 ( 0.03 %),            745536025 ( 0.03 %)
ROOT                     ,          20903426138 ( 0.93 %),        2241613887403 (100.00 %)
STAGED_FUNCTION          ,              2404728 ( 0.00 %),            514411583 ( 0.02 %)
TYPE_CACHE_INSERT        ,            600301181 ( 0.03 %),            600301181 ( 0.03 %)
TYPE_CACHE_LOOKUP        ,           2905647622 ( 0.13 %),           2905647622 ( 0.13 %)
VERIFY_Edges             ,            345359034 ( 0.02 %),           2150140787 ( 0.10 %)
VERIFY_Graph             ,             21222541 ( 0.00 %),             21222541 ( 0.00 %)
VERIFY_Methods           ,              7734280 ( 0.00 %),              7734280 ( 0.00 %)
BioTurboNick commented 1 year ago

That's awesome, thanks for poking into it! Love identifying a good quadratic-time algorithm to optimize.

Does Julia need to look up the root each time it's being compressed? If there are many copies of the same method instances, that's a lot of unnecessary repeated work. Can the root be stored with the method instance at an earlier point, so the lookup is only performed once?

Can the lookup operation be turned into a dictionary access rather than array iteration, so the lookup itself is constant-time?

BioTurboNick commented 1 year ago

@JeffBezanson @timholy - Since you've both most recently (2-3 years ago :-p) touched literal_val_id, what are your impressions? Are either of my suggestions potentially workable, or are there obvious reasons those couldn't work that I'm not seeing?

Trying to gauge if this is feasible for me to tackle, or if it's a much bigger problem to untangle.

vtjnash commented 1 year ago

That array should never be long, since that can cause a myriad of other problems. How is that happening and how can that be avoided?

benlorenz commented 1 year ago

I added a little bit of code to print the maximal size for that array:

    static int max = 0;
    if (l > 1000 && l > max) {
        max = l;
        printf("rootmax: %d\n", max);
    }

On master (more precisely f407a4cac, including #41099) our testsuite ends with:

rootmax: 111575

On my other build where I reverted #41099 (see my earlier posts for the precise commits) I get:

rootmax: 13894

13894 seems quite large as well? But I guess it is a lot better than 111575.

I really don't understand the details, I am just trying to debug what is happening here.

So adding the Methods / MethodInstances to those LineNumberNodes seems to trigger a lot more roots?

BioTurboNick commented 1 year ago

Was just exploring the same kind of testing.

When Julia is started for the first time after building with that PR (make cleanall, deleted .julia), literal_val_id is called 1,702,882 times (compared with 109 on second run).

With the commit just before the PR merge, this count is 870,750.

EDIT: (The fact that it's almost exactly double seems suspicious.)

BioTurboNick commented 1 year ago

It appears that literal_val_id is the only method that adds to a method's root table (via jl_add_method_root).

So the MethodInstances being added to LineInfoNodes aren't present in the parent Method's root table, which results in them being added to the Method's roots. Previously, they would all just map to the method name (as a Symbol), and thus share the same root. I believe?

benlorenz commented 1 year ago

13894 seems quite large as well? But I guess it is a lot better than 111575.

Some more details from running our testsuite, inspired by the plots from #14556, number of times a specific roots array length is encountered in literal_val_id:

rootcounts

Total number of calls to jl_egal inside literal_val_id:

giordano commented 1 year ago

I also noticed that running tests of a package of mine on c58e508fdd takes sensibly longer than on v1.9, but haven't investigate what's the cause of the longer runtime.

BioTurboNick commented 1 year ago

Thanks for that plot - some other observations from Julia on first run:

# previous
calls to `literal_val_id`: 870,711
----where it's a symbol/concrete type: 758,439
----where it isn't: 112,272
new roots added: 81,344
maximum number of roots per method observed: 1,153

# PR 41099
calls to `literal_val_id`: 1,702,467
----where it's a symbol/concrete type: 661,561
----where it isn't: 1,040,906
new roots added: 226,637
maximum number of roots per method observed: 2,834
BioTurboNick commented 1 year ago

The Base method with the most roots after the PR is materialize(bc::Base.Broadcast.Broadcasted)

Before the PR it is iterate(g::Base.Generator, s...)

BioTurboNick commented 1 year ago

Again with a fresh Julia build/start:

Before the PR, materialize was responsible for 197,722 calls to jl_egal, out of 7,340,891 total (3%).

After the PR, materialize is responsible for 16,344,836 calls to jl_egal, out of 183,833,844 total (9%).

So... is adding these roots incorrect in some way, or just inconvenient for the current implementation? And particularly bad for certain generic methods that have numerous, possibly single-use, specializations?

Is there documentation of what roots are designed to do somewhere, so we can better understand?

vtjnash commented 1 year ago

Generally speaking, it is incorrect, as these roots are supposed to be for syntax, not compiler-generated content. Do we need to revert that PR, or can we partly disable whatever part is breaking this array?

BioTurboNick commented 1 year ago

When you say the roots are for syntax, what does that mean?

The types of the roots of Base.materialize in 1.9.0 are:

7-element Vector{DataType}:
 Symbol
 DataType
 Core.MethodInstance
 Module
 UnionAll
 ErrorException
 String

Should Core.MethodInstance not be in there? (If that's what you mean by compiler-generated content?)

I'm looking back up the stack, in jl_encode_value_ - I notice three frames of that here. There's an else if case for jl_lineinfonode_type, which calls jl_encode_value_ again on each field.

The method roots get added in the fallback else, via jl_encode_as_indexed_root. So that is normally where MethodInstances end up.

We could add a case inside the jl_lineinfonode_type case for the field to handle it differently. But then how would the MethodInstance be compressed without a root?

BioTurboNick commented 1 year ago

Looking more:

jl_compress_ir is called from:

CodeInfo objects contain the linetable, and a parent MethodInstance field. So currently, the parent of each CodeInfo would end up added to a Method's roots. But with my PR, the linetable being compressed is also adding the linetable MethodInstances to the method's roots when previously they would have just been Symbols.

(For the CodeInstance case, it has a def MethodInstance field as well as an inferred CodeInfo field, so I think MethodInstance roots could also be added there?)

Looking at the roots of findfirst(ch::AbstractChar, string::AbstractString) in 1.9.0:

8-element Vector{Any}:
 :findfirst
 Symbol("strings/search.jl")
 Base.Fix2{typeof(==), Char}
 typeof(==) (singleton type of function ==, subtype of Function)
 MethodInstance for findnext(::Base.Fix2{typeof(==), Char}, ::String, ::Int64)
 :findnext
 :Fix2
 :EachStringIndex

Here, a MethodInstance contained in a specialization of findfirst, which I believe is coming from the code field:

mi = methods(findfirst)[1].specializations[1]
ci = ccall(:jl_uncompress_ir, Any, (Any, Any, Any), mi.def, mi.cache, mi.cache.inferred)
ci.code
3-element Vector{Any}:
 :(%new(Base.Fix2{typeof(==), Char}, ==, _2))
 :($(Expr(:invoke, MethodInstance for findnext(::Base.Fix2{typeof(==), Char}, ::String, ::Int64), :(Base.findnext), :(%1), Core.Argument(3), 1)))
 :(return %2)

Though this doesn't show up in the linetable:

ci.linetable
9-element Vector{Core.LineInfoNode}:
 Core.LineInfoNode(Base, :findfirst, Symbol("strings/search.jl"), 140, 0)
 Core.LineInfoNode(Base, :(==), Symbol("operators.jl"), 1147, 1)
 Core.LineInfoNode(Base, :Fix2, Symbol("operators.jl"), 1121, 2)
 Core.LineInfoNode(Base, :convert, Symbol("char.jl"), 187, 3)
 Core.LineInfoNode(Base, :findfirst, Symbol("array.jl"), 2122, 1)
 Core.LineInfoNode(Base, :keys, Symbol("strings/basic.jl"), 574, 5)
 Core.LineInfoNode(Base, :EachStringIndex, Symbol("strings/basic.jl"), 572, 6)
 Core.LineInfoNode(Base, :EachStringIndex, Symbol("strings/basic.jl"), 572, 7)
 Core.LineInfoNode(Base, :convert, Symbol("essentials.jl"), 298, 8)

Is one issue that roots is serving a dual purpose as a compression token table and something else?

BioTurboNick commented 1 year ago

It sounds like #32705 was trying to be a solution to this, but it is closed? Last comment from you @vtjnash suggested it just needed more work and is conflicted - have those conflicts resolved? Or are there other reasons that isn't a good approach?

BioTurboNick commented 1 year ago

Instructive to compare with the roots of this findfirst method in 1.9.0 with the current PR:

vs. current master:
11-element Vector{Any}:
 :findfirst
 Symbol("strings/search.jl")
 Base.Fix2{typeof(==), Char}
 MethodInstance for findnext(::Base.Fix2{typeof(==), Char}, ::String, ::Int64)
 :findnext
 typeof(==) (singleton type of function ==, subtype of Function)
 MethodInstance for ==(::Char)
 MethodInstance for Base.Fix2(::typeof(==), ::Char)
 MethodInstance for findfirst(::Base.Fix2{typeof(==), Char}, ::String)
 MethodInstance for keys(::String)
 MethodInstance for Base.EachStringIndex(::String)

Observations:

BioTurboNick commented 1 year ago

What if...

The names of the methods of LineInfoNodes are stored in the Method roots (current behavior) while any MethodInstance references are stored in a separate mi_roots table in the CodeInstance or MethodInstance?

Then the question would be where MethodInstances that appear in CodeInfo.code should go - do they stay in the Method roots table, or go in this new mi_roots table?

EDIT: After trying to implement something like this, learned of a flaw: in jl_decompress_ir, sometimes a CodeInstance is not available, so there would be no access to the corresponding MethodInstance to look up those roots there. So the MethodInstances that show up in code really do need to be in the Method roots; but possibly could encode both the symbol and the MethodInstance just for the linetable, using the symbol as a fallback if the MethodInstance isn't available.

gbaraldi commented 1 year ago

I've hit this again, we spend almost 10% of the time in some workloads on the test suite just doing jl_encode_as_indexed_root. I wonder if we should temporarily disable/revert the change until we can make not have such a big slowdown/size penalty

BioTurboNick commented 1 year ago

Ugh, yeah that might be a good idea - the part that would need to be reverted is minimal, the stacktrace lookup methods could be left in and ready.

But hopefully it can be restored in 1.11.

gbaraldi commented 1 year ago

@BioTurboNick could I bother you to open a PR to temporarily revert and mark it as 1.10 backport?