JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.39k stars 5.45k forks source link

precompilation #7977

Closed StefanKarpinski closed 10 years ago

StefanKarpinski commented 10 years ago

We already have precompilation of the Julia system image. There's been a lot of discussion of how to precompile packages. This issue is for discussion of how to approach precompilation of packages.

StefanKarpinski commented 10 years ago

There's some amount of processing of individual source files and packages that can be cached – parsing & lowering, and compressed AST generation. Some amount of type inference and code generation could also potentially be done independently, but that's much trickier. Full type inference, inlining, and code generation in general cannot be done until later. Currently, system image generation effectively takes a "snapshot" of the current process – what if we just make snapshotting really easy and make it possible to take a snapshot of a Julia process at any point and then start that snapshot again later? This gives a "layering" approach to precompilation and is quite easy to think about: get your REPL in a state you want to work from, snapshot it, and then start working from the snapshot instead of from an empty REPL.

JeffBezanson commented 10 years ago

dup of #4373

vtjnash commented 10 years ago

The main issue is what a snapshot would even mean. If I opened a file, or initialized a dll, what does the new process see? If I start a task, presumably this is part of the state too, but what does it mean to snapshot a Task?

timholy commented 10 years ago

Whole-system snapshots don't seem to be that different from our current userimg.jl approach, and that doesn't seem to be very popular. (I seem to be one of very few people reporting bugs with it...).

I think its main limitation is that anything I use heavily (i.e., in practice contributes a lot to my load times) also tends to be something I change frequently. I don't want to have to rebuild all of Julia to test a one-line fix. (Yes, you have to do that if you're modifying base Julia, but we all know that's not ideal, and it gets far worse if you're building Julia and precompiling a dozen packages.) So I find myself choosing between whether package A should be precompiled or remain flexible. In practice I tend to precompile one or two packages, tops.

We need a granular approach: package A can be precompiled as a standalone object, to be updated if A's source (or any macros in other packages it depends on) changes. I know it's not easy, but it's reality.

johnmyleswhite commented 10 years ago

+1 for dependency tracking to determine when compilation needs to happen. I had to stop using userimg.jl because I would forget to recompile Julia when working on packages like DataFrames.

JeffBezanson commented 10 years ago

Speeding up the compiler can also be part of this. I don't know what kind of gains are possible, but 2x seems a conservative assumption. It's still not well understood why the system does so much work when loading packages. In theory, if a package contains only type and method definitions, loading it should take hardly any time, because compilation is mostly lazy. In practice, many packages run code at load time, but my impression is that they don't run enough stuff to require compiling a large percentage of the package.

timholy commented 10 years ago

Agreed, especially given that profiling indicates that most of the time is spent in flisp.c. See https://github.com/JuliaLang/julia/issues/7280#issuecomment-46241911 and the gist posted below that.

JeffBezanson commented 10 years ago

commit 8973ed1f95857bce3505e7f5241dba1b5da7bb4d might change that feature of the profile, though it's clearly not enough.

jakebolewski commented 10 years ago

@timholy have you gone back and profiled again after https://github.com/JuliaLang/julia/commit/8973ed1f95857bce3505e7f5241dba1b5da7bb4d?

timholy commented 10 years ago

Nope, but if your machine happens to give good C backtraces it's really easy to test this yourself:

using DataFrames, ProfileView
@profile reload("DataFrames")  # can't have using inside @profile
ProfileView.view(C=true)

I just don't happen to be on such a machine at the moment.

(If Tk doesn't work for you, the IJulia interface is really nice.)

timholy commented 10 years ago

It's all in flisp.c. More than 90% of the time.

I can't export the graph for you easily at the moment because of https://github.com/timholy/ProfileView.jl/issues/16. But if the above statements work for you, you can easily get this (and far more) information directly yourself.

vtjnash commented 10 years ago

Maybe you can parse using JuliaParser to see where the possible hot-spots are (since it shares a common structure with the flisp parser)?

timholy commented 10 years ago

Actually, I could do a screen capture: reloaddataframes reloaddataframes2

Open the images in a separate tab and you'll see them at high resolution. The first shows me hovering over a brown bar, with flisp.c, fl_applyln: line 761. That bar takes up about 60% of the horizontal width, meaning it (and things that it calls, which represent the stack above it) accounts for 60% of the time. The second image is me hovering over a green bar in the next big stack to the right, which is also the same spot in the same file (but arrived at via a different call-chain). That accounts for most of the rest of the time.

So basically everything except the two narrow stacks on the right are flisp.c. And guess what: the first of these is flisp.c, too! The very last substantive stack is in gf.c, remove_conflicting: 1209.

julia> versioninfo()
Julia Version 0.3.0-rc3+8
Commit 1465486* (2014-08-12 06:46 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY SANDYBRIDGE)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
JeffBezanson commented 10 years ago

That makes me think eval and/or macro expansion is the culprit, since that requires running the front-end's lowering passes. We know the parser is not this slow.

timholy commented 10 years ago

It's hard to find a significant package that doesn't use any macros or call eval. But I just tested with HyperDualNumbers, which has just a single

@vectorize_1arg Real hyper

and yet has >300loc.

The results are basically the same. If I hover over the bar corresponding to ./boot.jl, include: line 245: reloadhyperdual and then work my way upward until it starts fragmenting into subfunctions, I get this call chain:

/boot.jl, include: line 245
/usr/local/julia/julia/src/toplevel.c, jl_load: line 578
/usr/local/julia/julia/src/toplevel.c, jl_parse_eval_all: line 531
/usr/local/julia/julia/src/ast.c, jl_parse_next: line 514
/usr/local/julia/julia/src/flisp/flisp.c, fl_applyn: line 761
/usr/local/julia/julia/src/flisp/flisp.c, _applyn: line 725
/usr/local/julia/julia/src/flisp/flisp.c, apply_cl: line 1892
/usr/local/julia/julia/src/flisp/flisp.c, do_trycatch: line 935

For the narrower stack immediately to the right, it starts this way:

./boot.jl, include: line 245
/usr/local/julia/julia/src/toplevel.c, jl_load: line 578
/usr/local/julia/julia/src/toplevel.c, jl_parse_eval_all: line 541
/usr/local/julia/julia/src/toplevel.c, jl_toplevel_eval_flex: line 455
/usr/local/julia/julia/src/ast.c, jl_expand: line 551
/usr/local/julia/julia/src/flisp/flisp.c, fl_applyn: line 761

and then continues in the same fashion (with the exact same lines).

Again, it's probably >90% of the total time, and this for a package with almost no macros.

JeffBezanson commented 10 years ago

HyperDualNumbers also takes very little total time to load. For a package that loads very quickly, it's possible most of the time is in the parser.

vtjnash commented 10 years ago

I have some code around showing that parsing a few hundred simple comprehensions in a function takes a few milliseconds. Lowering those to Julia AST for sticking in the function table then takes several seconds.

timholy commented 10 years ago

Nope. Just created my own, super-exciting package (I should register this sucker :wink:):

function makescript(filename)
    open(filename, "w") do file
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "function $funcname(x)\n    x^2\nend\n")
        end
    end
end
makescript("/tmp/testscript.jl")
@profile include("/tmp/testscript.jl")

Load time is 6.8 seconds. And the result is even clearer: looks like about 97% of the time is in flisp.c.

As I believe I once mentioned, I have seen an extremely rare bug in ProfileView, but in this case I've checked Profile.print(io, C=true) and gone through the resulting file in emacs. It confirms the ProfileView results.

timholy commented 10 years ago

Man, if we just cache the lowered code, who cares about full precompilation?!?

JeffBezanson commented 10 years ago

Wow that is really shady. The front end must be doing something really stupid.

Fortunately, flisp has a profiler:

https://github.com/JeffBezanson/femtolisp/blob/master/tests/test.lsp#L219

timholy commented 10 years ago

Well, of course these don't test actually using the code, so of course we'll want compilation too.

To test this, I modified my script to call each one of these 10000 functions once:

function makescript(filename)
    open(filename, "w") do file
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "function $funcname(x)\n    x^2\nend\n")
        end
        write(file, "x = 3.2\n")
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "$funcname(x)\n")
        end
    end
end

The load time went from 6.8 seconds to 18 seconds. However, lowering is still more than half the time (remember that calling the functions added another 10001 lines to the file). Of the remainder, jl_compile and jl_generate_fptr account for most of the rest. Inference is roughly 10%.

timholy commented 10 years ago

Glad to hear about the flisp profiler! I'll have to leave this one up to you.

jakebolewski commented 10 years ago

Parsing performance is definitely not the problem.

julia> @time require("/tmp/testscript.jl")
elapsed time: 85.278999524 seconds (140911932 bytes allocated, 0.18% gc time)

julia> @time parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.522348909 seconds (8937864 bytes allocated)

julia> @time Parser.parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.121571927 seconds (66361048 bytes allocated, 9.37% gc time)
JeffBezanson commented 10 years ago

This discussion has been fruitful. Everybody, I am really sorry. I made an absurd mistake, which has been causing us to call lowering twice on many top-level expressions. I was mis-identifying which expressions needed to be expanded. Oh-so-usefully, the lowering pass is idempotent so I never noticed until I profiled it carefully.

454344fcea17021cb6ca5687d0a9f41daedd7e9e fixes it. For me, Gadfly now loads in 27 seconds instead of 48. Too embarrassed to celebrate :|

johnmyleswhite commented 10 years ago

Julia's twice as fast now! Rejoice!

JeffBezanson commented 10 years ago

True, it's not bad for a 1-line change.

kmsquire commented 10 years ago

Really, Jeff, try not to be so hard on yourself! Most of us love everything else you've done in Julia so much that we're more than willing to give you a break for something like this--we're just happy that it's faster now! Cheers!

JeffBezanson commented 10 years ago

Thanks. It's ok, I just get scared thinking about how much time people have had to spend waiting for packages to load if you add it all up...

staticfloat commented 10 years ago

I think you need to sneak a few more of these things in here. Subtly, of course. Then, right before a release, BLAMO! Huge speed improvements! ;)

JeffBezanson commented 10 years ago

By the way, here is how to profile the front-end:

  1. Place profile.scm (https://gist.github.com/JeffBezanson/f58c2bc1cb862cbd4e63) in src/. You have to list functions to profile at the bottom.
  2. Add (load "profile.scm") at the bottom of jlfrontend.scm
  3. Rebuild.
  4. Start julia and include a source file
  5. Run ccall(:jl_lisp_prompt,Void,())
  6. Run (show-profiles)

I get

Function                  #Calls     Time (seconds)
--------                  ------     --------------
expand-toplevel-expr      10007      3.036357641220093
expand-toplevel-expr--    10007      2.878895044326782
julia-expand1             10010      1.948884010314941
julia-parse               10011      1.167325019836426
julia-expand01            10010      0.6922774314880371
to-goto-form              10010      0.6674525737762451
identify-locals           10010      0.5465188026428223
analyze-variables         10010      0.5336084365844727
to-LFF                    10010      0.3194801807403564
expand-binding-forms      10010      0.2210719585418701
flatten-scopes            10010      0.1799592971801758
toplevel-expr-globals     10010      0.1713950634002686
expand-forms              10010      0.1364684104919434
julia-expand-macros       10007      0.03557300567626953
find-possible-globals     10010      0.0246436595916748
IainNZ commented 10 years ago

Can we cherry pick this onto release-0.3 or would that be getting greedy?

JeffBezanson commented 10 years ago

Already done :)

jakebolewski commented 10 years ago

Wow great find! These are the numbers I was getting on the scheme side of things when poking around earlier tonight. https://gist.github.com/jakebolewski/105415d9a0c1883135e7.

JeffBezanson commented 10 years ago

Interesting. identify-locals clearly needs work; it seems to scale especially badly and end up taking most of the time. This is kind of good news since it's only ~50 lines of code. to-goto-form also takes a bit longer than it's entitled to.

jakebolewski commented 10 years ago

Yeah I kind of got sidetracked when I noticed that. It seems like a good optimization opportunity.

On Wednesday, August 13, 2014, Jeff Bezanson notifications@github.com wrote:

Interesting. identify-locals clearly needs work; it seems to scale especially badly and end up taking most of the time. This is kind of good news since it's only ~50 lines of code. to-goto-form also takes a bit longer than it's entitled to.

— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/7977#issuecomment-52014460.

vtjnash commented 10 years ago

True, it's not bad for a 1-line change.

to quote @JeffBezanson "Would be great to find more low-hanging fruit like this." (https://github.com/JuliaLang/julia/commit/8973ed1f95857bce3505e7f5241dba1b5da7bb4d)

timholy commented 10 years ago

Really nice, @JeffBezanson! And thanks for the instructions on profiling the front-end.

timholy commented 10 years ago

@JeffBezanson, still looking at this from the @profile side of things and flisp performance. It looks like most of the time is spent on this (or occasionally this) nest of calls. The last is a bit of a guess, but it looks like the only line near the one indicated in the backtrace (1272, not 1270) that seems plausible.

It's recursive, so I see this at least 8 deep. The part of this that seems to do real work is apply_cl, since it's the only one for which one sees that other lines in the same function contribute to the execution time.

Anything to be done about this? Or simply a reflection of the amount of work that needs doing?

timholy commented 10 years ago

I'm also interested in figuring out where the bottlenecks in the compile chain are. Unfortunately, once it gets inside LLVM I don't get file/function names or line numbers anymore. I did a

make -C deps clean-llvm
make LLVM_DEBUG=1 -j 8

and while it's clear from the output that it did a Debug build, it didn't help. Any ideas?

Keno commented 10 years ago

Did you clean julia? We link LLVM statically, so you have to rebuild libjulia after reinstalling.

timholy commented 10 years ago

Having just been given a big performance boost, I'll show my gratitude by being greedy for more, but bear with me.

A couple more fun observations: macros (at least, simple ones) are not deadly for lowering performance, and their impact is on the flisp side, not the Julia side. Using a fresh build, the original version of makescript (the one that does not call the functions) gave this:

julia> @time include("/tmp/testscript.jl")
elapsed time: 4.690292857 seconds (17599748 bytes allocated)
myfunc10000 (generic function with 1 method)

When I changed it to

function makescript(filename)
    open(filename, "w") do file
        write(file, """
macro square(ex)
    dosquare(ex)
end
function dosquare(ex)
    :((\$ex)^2)
end
""")
        for i = 1:10000
            funcname = "myfunc$i"
            write(file, "function $funcname(x)\n    @square x\nend\n")
        end
    end
end

I got

julia> @time include("/tmp/testscript.jl")
elapsed time: 5.456327574 seconds (18961028 bytes allocated)
myfunc10000 (generic function with 1 method)

One part that surprised me was how little extra memory got allocated. Even more interestingly, check out the profile result (using format=:flat):

julia> Profile.print(format=:flat)
 Count File               Function            Line
  6110 ./boot.jl          include              245
  6110 ./loading.jl       include_from_node1   128
    10 /tmp/testscript.jl anonymous              2
     6 /tmp/testscript.jl dosquare               5
  6110 REPL.jl            eval_user_input       54
  6110 profile.jl         anonymous             14
  6110 task.jl            anonymous             96

The julia side of this, dosquare, is negligible. So I infer that the actual impact of macros is that (1) you have to lower the original macro expression, and (2) you then have to lower the expression produced by the macro. Both of which are handled in flisp. Are you sure we couldn't write the lowering part in Julia and get better performance :smiling_imp:? No doubt it would hammer the gc, but hopefully we're about to get a better gc anyway.

Finally, I'll point out that even for loading a package like Gadfly, flisp is >90% of the time. So if nothing else, caching the lowered code should be a fantastic step forward.

timholy commented 10 years ago

Thanks Keno, that did the trick.

I'll try to post the svg, if chrome ever finishes rendering it. (Tk renders it in a snap, whereas chrome has been working on it for > 2 minutes now. Less enthusiastic now about the-browser-as-a-graphics-toolkit :smile:.)

But as an executive summary, these are the 4 biggest items called from FunctionPassManager:

/usr/local/julia/julia/deps/llvm-3.3/lib/Analysis/LoopInfo.cpp, _ZN4llvm8LoopInfo13runOnFunctionERNS_8FunctionE: line 579
/usr/local/julia/julia/deps/llvm-3.3/lib/IR/Dominators.cpp, _ZN4llvm13DominatorTree13runOnFunctionERNS_8FunctionE: line 73
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/InstCombine/InstructionCombining.cpp, _ZN4llvm12InstCombiner13runOnFunctionERNS_8FunctionE: line 2475
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/InstCombine/InstructionCombining.cpp, _ZN4llvm12InstCombiner13runOnFunctionERNS_8FunctionE: line 2499

and another 3 noteworthy ones are

/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/Scalar/EarlyCSE.cpp, runOnFunction: line 604
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/Scalar/GVN.cpp, runOnFunction: line 2299
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/Scalar/Reassociate.cpp, runOnFunction: line 1985

From MachineFunctionPass.cpp, the huge item is

/usr/local/julia/julia/deps/llvm-3.3/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, _ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE: line 391

Don't know if that suggests any windows of opportunity. For example, if we could have very-fast-but-not-optimal compilation by default, people's practical/subjective experience might be improved. People who want the best runtime performance, at the cost of longer compilation, could start julia with julia -O. Or something.

jakebolewski commented 10 years ago

@timholy flisp has it's own GC which is independent of Julia's GC, so allocation happening in flisp will not show up in the final GC stats. Any allocation you are seeing due to the frontend is mostly allocation happening within the ast.c shim layer during the conversion between flisp's sexpr ast to julia's Expr objects.

I haven't done any rigorous profiling of the lowering code, but I wouldn't be surprised if a full Ast copy is happening at one or more point on the lowered ast. @JeffBezanson is this correct? Regardless, there is always the cost of allocating at least the initial lowered form of the ast. Reducing / combining the number of passes in the frontend should help reduce allocation pressure and be much faster. I want to look more into identify-locals, but it is doing a full traversal on the AST which seems expensive.

I would like to see a move to a Julia frontend at some point. It was one of the reasons I ported over the parser. However, in the beginning it wasn't clear that this would necessarily be a performance win as flisp is quite fast. That side of things is looking better, but it still is an all or nothing proposition. You have rewrite the whole frontend and show that the gains are significant enough to overcome the additional bootstrapping complexity. I'm also sympathetic that such rewrite's do not necessarily have big end user advantages. However, I feel a rewrite in this case would make it easier to solve some longstanding issues such as better parser error messages, poor error messages due to macro expansion, and macro-hygiene issues.

JeffBezanson commented 10 years ago

I agree a rewrite would ultimately be worth it.

The interesting thing is that all the passes do full AST traversals, and yet identify-locals is still much slower than most.

jakebolewski commented 10 years ago

Profiling this a bit more, it seems like gc accounts for ~ half the runtime of the lowering passes. It represents ~45% of the total front end time when profiling some of @timholy's examples for instance. The parser doesn't hit the gc as hard (< %20 of the time in gc). (Note: this is the gc built into flisp)

Parser Elapsed time: 1.807146072387695 seconds
gc time     0.42045 sec (21.4%)

Syntax Elapsed time: 6.710570812225342 seconds
gc time     3.71732 sec (43.2%)

Function                  #Calls     GC Time (seconds)
--------                  ------     --------------
toplevel-expr-globals     1          0.01719307899475098
julia-expand-macros       1          0.03013706207275391
flatten-scopes            1          0.09003400802612305
analyze-variables         1          0.1040594577789307
julia-expand01            1          0.1680479049682617
to-goto-form              1          1.050569534301758
add-local-decls           1          1.821250200271606
identify-locals           1          1.821250200271606
julia-expand1             1          3.065913200378418
expand-toplevel-expr--    1          3.281291246414185
expand-toplevel-expr      1          3.296869277954102
expand-toplevel-expr-     1          3.296869277954102

Function                  #Calls     Time (seconds)
--------                  ------     --------------
julia-expand-macros       1          0.05322694778442383
flatten-scopes            1          0.2244889736175537
toplevel-expr-globals     1          0.3535900115966797
analyze-variables         1          0.4261181354522705
julia-expand01            1          0.5767209529876709
to-goto-form              1          0.7994349002838135
add-local-decls           1          3.806056976318359
identify-locals           1          3.806058883666992
julia-expand1             1          5.256119966506958
expand-toplevel-expr--    1          6.240168809890747
expand-toplevel-expr-     1          6.355370044708252
expand-toplevel-expr      1          6.355376005172729
timholy commented 10 years ago

Great that you can so easily extract gc time. Does the gc proportion scale as you'd expect? I.e., when testscript.jl has a large number of functions, the runtime should scale linearly and the gc proportion should be constant.

jakebolewski commented 10 years ago

It seems to behave as you would expect (running your macro example above).

julia> macro square(ex)
           dosquare(ex)
       end

julia> function dosquare(ex)
           :((\$ex)^2)
       end
dosquare (generic function with 1 method)

julia> ccall(:jl_lisp_prompt, Void, ())
;  _
; |_ _ _ |_ _ |  . _ _
; | (-||||_(_)|__|_)|_)
;-------------------|----------------------------------------------------------

> (load "test.scm")

500
Parser Elapsed time: 0.05537104606628418 seconds, GC time: 0.005069255828857422
Syntax Elapsed time: 0.157073974609375 seconds, GC time: 0.04458403587341309

1000
Elapsed time: 0.1065411567687988 seconds, GC time: 0.005226850509643555
Elapsed time: 0.308441162109375 seconds, GC time: 0.0815742015838623

5000
Elapsed time: 0.5404751300811768 seconds, GC time: 0.03504753112792969
Elapsed time: 2.353021860122681 seconds, GC time: 1.003882169723511

10000
Elapsed time: 1.0449538230896 seconds, GC time: 0.03677964210510254
Elapsed time: 5.802626132965088 seconds, GC time: 2.604867458343506

15000
Elapsed time: 1.554717063903809 seconds, GC time: 0.04884672164916992
Elapsed time: 8.680172920227051 seconds, GC time: 3.113471746444702

20000
Elapsed time: 2.053093910217285 seconds, GC time: 0.0437476634979248
Elapsed time: 16.08238792419434 seconds, GC time: 7.611555337905884
#t

At 25,000 functions the lowering step segfaults :-)

timholy commented 10 years ago

It's a little worse than linear, whether you look at raw times or subtract the gc time. For example, for 20k it's about 6-fold worse than 5k no matter how you slice it. This isn't so far from linear that I think it hints at anything nefarious, however.

JeffBezanson commented 10 years ago

@jakebolewski Could you post the script that causes the segfault? I assume that's not supposed to happen? :)

JeffBezanson commented 10 years ago

Another batch of optimizations is in. For me it takes using Gadfly from ~27 sec to 23.7 sec.