Closed StefanKarpinski closed 10 years ago
There's some amount of processing of individual source files and packages that can be cached – parsing & lowering, and compressed AST generation. Some amount of type inference and code generation could also potentially be done independently, but that's much trickier. Full type inference, inlining, and code generation in general cannot be done until later. Currently, system image generation effectively takes a "snapshot" of the current process – what if we just make snapshotting really easy and make it possible to take a snapshot of a Julia process at any point and then start that snapshot again later? This gives a "layering" approach to precompilation and is quite easy to think about: get your REPL in a state you want to work from, snapshot it, and then start working from the snapshot instead of from an empty REPL.
dup of #4373
The main issue is what a snapshot would even mean. If I opened a file, or initialized a dll, what does the new process see? If I start a task, presumably this is part of the state too, but what does it mean to snapshot a Task?
Whole-system snapshots don't seem to be that different from our current userimg.jl
approach, and that doesn't seem to be very popular. (I seem to be one of very few people reporting bugs with it...).
I think its main limitation is that anything I use heavily (i.e., in practice contributes a lot to my load times) also tends to be something I change frequently. I don't want to have to rebuild all of Julia to test a one-line fix. (Yes, you have to do that if you're modifying base Julia, but we all know that's not ideal, and it gets far worse if you're building Julia and precompiling a dozen packages.) So I find myself choosing between whether package A should be precompiled or remain flexible. In practice I tend to precompile one or two packages, tops.
We need a granular approach: package A can be precompiled as a standalone object, to be updated if A's source (or any macros in other packages it depends on) changes. I know it's not easy, but it's reality.
+1 for dependency tracking to determine when compilation needs to happen. I had to stop using userimg.jl
because I would forget to recompile Julia when working on packages like DataFrames.
Speeding up the compiler can also be part of this. I don't know what kind of gains are possible, but 2x seems a conservative assumption. It's still not well understood why the system does so much work when loading packages. In theory, if a package contains only type and method definitions, loading it should take hardly any time, because compilation is mostly lazy. In practice, many packages run code at load time, but my impression is that they don't run enough stuff to require compiling a large percentage of the package.
Agreed, especially given that profiling indicates that most of the time is spent in flisp.c. See https://github.com/JuliaLang/julia/issues/7280#issuecomment-46241911 and the gist posted below that.
commit 8973ed1f95857bce3505e7f5241dba1b5da7bb4d might change that feature of the profile, though it's clearly not enough.
@timholy have you gone back and profiled again after https://github.com/JuliaLang/julia/commit/8973ed1f95857bce3505e7f5241dba1b5da7bb4d?
Nope, but if your machine happens to give good C backtraces it's really easy to test this yourself:
using DataFrames, ProfileView
@profile reload("DataFrames") # can't have using inside @profile
ProfileView.view(C=true)
I just don't happen to be on such a machine at the moment.
(If Tk doesn't work for you, the IJulia interface is really nice.)
It's all in flisp.c. More than 90% of the time.
I can't export the graph for you easily at the moment because of https://github.com/timholy/ProfileView.jl/issues/16. But if the above statements work for you, you can easily get this (and far more) information directly yourself.
Maybe you can parse using JuliaParser to see where the possible hot-spots are (since it shares a common structure with the flisp parser)?
Actually, I could do a screen capture:
Open the images in a separate tab and you'll see them at high resolution. The first shows me hovering over a brown bar, with flisp.c, fl_applyln: line 761
. That bar takes up about 60% of the horizontal width, meaning it (and things that it calls, which represent the stack above it) accounts for 60% of the time. The second image is me hovering over a green bar in the next big stack to the right, which is also the same spot in the same file (but arrived at via a different call-chain). That accounts for most of the rest of the time.
So basically everything except the two narrow stacks on the right are flisp.c
. And guess what: the first of these is flisp.c
, too! The very last substantive stack is in gf.c, remove_conflicting: 1209
.
julia> versioninfo()
Julia Version 0.3.0-rc3+8
Commit 1465486* (2014-08-12 06:46 UTC)
Platform Info:
System: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT NO_AFFINITY SANDYBRIDGE)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3
That makes me think eval
and/or macro expansion is the culprit, since that requires running the front-end's lowering passes. We know the parser is not this slow.
It's hard to find a significant package that doesn't use any macros or call eval
. But I just tested with HyperDualNumbers
, which has just a single
@vectorize_1arg Real hyper
and yet has >300loc.
The results are basically the same. If I hover over the bar corresponding to ./boot.jl, include: line 245
:
and then work my way upward until it starts fragmenting into subfunctions, I get this call chain:
/boot.jl, include: line 245
/usr/local/julia/julia/src/toplevel.c, jl_load: line 578
/usr/local/julia/julia/src/toplevel.c, jl_parse_eval_all: line 531
/usr/local/julia/julia/src/ast.c, jl_parse_next: line 514
/usr/local/julia/julia/src/flisp/flisp.c, fl_applyn: line 761
/usr/local/julia/julia/src/flisp/flisp.c, _applyn: line 725
/usr/local/julia/julia/src/flisp/flisp.c, apply_cl: line 1892
/usr/local/julia/julia/src/flisp/flisp.c, do_trycatch: line 935
For the narrower stack immediately to the right, it starts this way:
./boot.jl, include: line 245
/usr/local/julia/julia/src/toplevel.c, jl_load: line 578
/usr/local/julia/julia/src/toplevel.c, jl_parse_eval_all: line 541
/usr/local/julia/julia/src/toplevel.c, jl_toplevel_eval_flex: line 455
/usr/local/julia/julia/src/ast.c, jl_expand: line 551
/usr/local/julia/julia/src/flisp/flisp.c, fl_applyn: line 761
and then continues in the same fashion (with the exact same lines).
Again, it's probably >90% of the total time, and this for a package with almost no macros.
HyperDualNumbers
also takes very little total time to load. For a package that loads very quickly, it's possible most of the time is in the parser.
I have some code around showing that parsing a few hundred simple comprehensions in a function takes a few milliseconds. Lowering those to Julia AST for sticking in the function table then takes several seconds.
Nope. Just created my own, super-exciting package (I should register this sucker :wink:):
function makescript(filename)
open(filename, "w") do file
for i = 1:10000
funcname = "myfunc$i"
write(file, "function $funcname(x)\n x^2\nend\n")
end
end
end
makescript("/tmp/testscript.jl")
@profile include("/tmp/testscript.jl")
Load time is 6.8 seconds. And the result is even clearer: looks like about 97% of the time is in flisp.c
.
As I believe I once mentioned, I have seen an extremely rare bug in ProfileView, but in this case I've checked Profile.print(io, C=true)
and gone through the resulting file in emacs. It confirms the ProfileView results.
Man, if we just cache the lowered code, who cares about full precompilation?!?
Wow that is really shady. The front end must be doing something really stupid.
Fortunately, flisp has a profiler:
https://github.com/JeffBezanson/femtolisp/blob/master/tests/test.lsp#L219
Well, of course these don't test actually using the code, so of course we'll want compilation too.
To test this, I modified my script to call each one of these 10000 functions once:
function makescript(filename)
open(filename, "w") do file
for i = 1:10000
funcname = "myfunc$i"
write(file, "function $funcname(x)\n x^2\nend\n")
end
write(file, "x = 3.2\n")
for i = 1:10000
funcname = "myfunc$i"
write(file, "$funcname(x)\n")
end
end
end
The load time went from 6.8 seconds to 18 seconds. However, lowering is still more than half the time (remember that calling the functions added another 10001 lines to the file). Of the remainder, jl_compile
and jl_generate_fptr
account for most of the rest. Inference is roughly 10%.
Glad to hear about the flisp
profiler! I'll have to leave this one up to you.
Parsing performance is definitely not the problem.
julia> @time require("/tmp/testscript.jl")
elapsed time: 85.278999524 seconds (140911932 bytes allocated, 0.18% gc time)
julia> @time parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.522348909 seconds (8937864 bytes allocated)
julia> @time Parser.parse(open(readall, "/tmp/testscript.jl"));
elapsed time: 1.121571927 seconds (66361048 bytes allocated, 9.37% gc time)
This discussion has been fruitful. Everybody, I am really sorry. I made an absurd mistake, which has been causing us to call lowering twice on many top-level expressions. I was mis-identifying which expressions needed to be expanded. Oh-so-usefully, the lowering pass is idempotent so I never noticed until I profiled it carefully.
454344fcea17021cb6ca5687d0a9f41daedd7e9e fixes it. For me, Gadfly now loads in 27 seconds instead of 48. Too embarrassed to celebrate :|
Julia's twice as fast now! Rejoice!
True, it's not bad for a 1-line change.
Really, Jeff, try not to be so hard on yourself! Most of us love everything else you've done in Julia so much that we're more than willing to give you a break for something like this--we're just happy that it's faster now! Cheers!
Thanks. It's ok, I just get scared thinking about how much time people have had to spend waiting for packages to load if you add it all up...
I think you need to sneak a few more of these things in here. Subtly, of course. Then, right before a release, BLAMO! Huge speed improvements! ;)
By the way, here is how to profile the front-end:
profile.scm
(https://gist.github.com/JeffBezanson/f58c2bc1cb862cbd4e63) in src/
. You have to list functions to profile at the bottom.(load "profile.scm")
at the bottom of jlfrontend.scm
ccall(:jl_lisp_prompt,Void,())
(show-profiles)
I get
Function #Calls Time (seconds)
-------- ------ --------------
expand-toplevel-expr 10007 3.036357641220093
expand-toplevel-expr-- 10007 2.878895044326782
julia-expand1 10010 1.948884010314941
julia-parse 10011 1.167325019836426
julia-expand01 10010 0.6922774314880371
to-goto-form 10010 0.6674525737762451
identify-locals 10010 0.5465188026428223
analyze-variables 10010 0.5336084365844727
to-LFF 10010 0.3194801807403564
expand-binding-forms 10010 0.2210719585418701
flatten-scopes 10010 0.1799592971801758
toplevel-expr-globals 10010 0.1713950634002686
expand-forms 10010 0.1364684104919434
julia-expand-macros 10007 0.03557300567626953
find-possible-globals 10010 0.0246436595916748
Can we cherry pick this onto release-0.3 or would that be getting greedy?
Already done :)
Wow great find! These are the numbers I was getting on the scheme side of things when poking around earlier tonight. https://gist.github.com/jakebolewski/105415d9a0c1883135e7.
Interesting. identify-locals
clearly needs work; it seems to scale especially badly and end up taking most of the time. This is kind of good news since it's only ~50 lines of code. to-goto-form
also takes a bit longer than it's entitled to.
Yeah I kind of got sidetracked when I noticed that. It seems like a good optimization opportunity.
On Wednesday, August 13, 2014, Jeff Bezanson notifications@github.com wrote:
Interesting. identify-locals clearly needs work; it seems to scale especially badly and end up taking most of the time. This is kind of good news since it's only ~50 lines of code. to-goto-form also takes a bit longer than it's entitled to.
— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/7977#issuecomment-52014460.
True, it's not bad for a 1-line change.
to quote @JeffBezanson "Would be great to find more low-hanging fruit like this." (https://github.com/JuliaLang/julia/commit/8973ed1f95857bce3505e7f5241dba1b5da7bb4d)
Really nice, @JeffBezanson! And thanks for the instructions on profiling the front-end.
@JeffBezanson, still looking at this from the @profile
side of things and flisp performance. It looks like most of the time is spent on this (or occasionally this) nest of calls. The last is a bit of a guess, but it looks like the only line near the one indicated in the backtrace (1272, not 1270) that seems plausible.
It's recursive, so I see this at least 8 deep. The part of this that seems to do real work is apply_cl
, since it's the only one for which one sees that other lines in the same function contribute to the execution time.
Anything to be done about this? Or simply a reflection of the amount of work that needs doing?
I'm also interested in figuring out where the bottlenecks in the compile chain are. Unfortunately, once it gets inside LLVM I don't get file/function names or line numbers anymore. I did a
make -C deps clean-llvm
make LLVM_DEBUG=1 -j 8
and while it's clear from the output that it did a Debug build, it didn't help. Any ideas?
Did you clean julia? We link LLVM statically, so you have to rebuild libjulia after reinstalling.
Having just been given a big performance boost, I'll show my gratitude by being greedy for more, but bear with me.
A couple more fun observations: macros (at least, simple ones) are not deadly for lowering performance, and their impact is on the flisp side, not the Julia side. Using a fresh build, the original version of makescript
(the one that does not call the functions) gave this:
julia> @time include("/tmp/testscript.jl")
elapsed time: 4.690292857 seconds (17599748 bytes allocated)
myfunc10000 (generic function with 1 method)
When I changed it to
function makescript(filename)
open(filename, "w") do file
write(file, """
macro square(ex)
dosquare(ex)
end
function dosquare(ex)
:((\$ex)^2)
end
""")
for i = 1:10000
funcname = "myfunc$i"
write(file, "function $funcname(x)\n @square x\nend\n")
end
end
end
I got
julia> @time include("/tmp/testscript.jl")
elapsed time: 5.456327574 seconds (18961028 bytes allocated)
myfunc10000 (generic function with 1 method)
One part that surprised me was how little extra memory got allocated. Even more interestingly, check out the profile result (using format=:flat
):
julia> Profile.print(format=:flat)
Count File Function Line
6110 ./boot.jl include 245
6110 ./loading.jl include_from_node1 128
10 /tmp/testscript.jl anonymous 2
6 /tmp/testscript.jl dosquare 5
6110 REPL.jl eval_user_input 54
6110 profile.jl anonymous 14
6110 task.jl anonymous 96
The julia side of this, dosquare
, is negligible. So I infer that the actual impact of macros is that (1) you have to lower the original macro expression, and (2) you then have to lower the expression produced by the macro. Both of which are handled in flisp. Are you sure we couldn't write the lowering part in Julia and get better performance :smiling_imp:? No doubt it would hammer the gc, but hopefully we're about to get a better gc anyway.
Finally, I'll point out that even for loading a package like Gadfly, flisp
is >90% of the time. So if nothing else, caching the lowered code should be a fantastic step forward.
Thanks Keno, that did the trick.
I'll try to post the svg, if chrome ever finishes rendering it. (Tk renders it in a snap, whereas chrome has been working on it for > 2 minutes now. Less enthusiastic now about the-browser-as-a-graphics-toolkit :smile:.)
But as an executive summary, these are the 4 biggest items called from FunctionPassManager:
/usr/local/julia/julia/deps/llvm-3.3/lib/Analysis/LoopInfo.cpp, _ZN4llvm8LoopInfo13runOnFunctionERNS_8FunctionE: line 579
/usr/local/julia/julia/deps/llvm-3.3/lib/IR/Dominators.cpp, _ZN4llvm13DominatorTree13runOnFunctionERNS_8FunctionE: line 73
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/InstCombine/InstructionCombining.cpp, _ZN4llvm12InstCombiner13runOnFunctionERNS_8FunctionE: line 2475
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/InstCombine/InstructionCombining.cpp, _ZN4llvm12InstCombiner13runOnFunctionERNS_8FunctionE: line 2499
and another 3 noteworthy ones are
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/Scalar/EarlyCSE.cpp, runOnFunction: line 604
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/Scalar/GVN.cpp, runOnFunction: line 2299
/usr/local/julia/julia/deps/llvm-3.3/lib/Transforms/Scalar/Reassociate.cpp, runOnFunction: line 1985
From MachineFunctionPass.cpp, the huge item is
/usr/local/julia/julia/deps/llvm-3.3/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, _ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE: line 391
Don't know if that suggests any windows of opportunity. For example, if we could have very-fast-but-not-optimal compilation by default, people's practical/subjective experience might be improved. People who want the best runtime performance, at the cost of longer compilation, could start julia with julia -O
. Or something.
@timholy flisp has it's own GC which is independent of Julia's GC, so allocation happening in flisp will not show up in the final GC stats. Any allocation you are seeing due to the frontend is mostly allocation happening within the ast.c shim layer during the conversion between flisp's sexpr ast to julia's Expr objects.
I haven't done any rigorous profiling of the lowering code, but I wouldn't be surprised if a full Ast copy is happening at one or more point on the lowered ast. @JeffBezanson is this correct? Regardless, there is always the cost of allocating at least the initial lowered form of the ast. Reducing / combining the number of passes in the frontend should help reduce allocation pressure and be much faster. I want to look more into identify-locals
, but it is doing a full traversal on the AST which seems expensive.
I would like to see a move to a Julia frontend at some point. It was one of the reasons I ported over the parser. However, in the beginning it wasn't clear that this would necessarily be a performance win as flisp is quite fast. That side of things is looking better, but it still is an all or nothing proposition. You have rewrite the whole frontend and show that the gains are significant enough to overcome the additional bootstrapping complexity. I'm also sympathetic that such rewrite's do not necessarily have big end user advantages. However, I feel a rewrite in this case would make it easier to solve some longstanding issues such as better parser error messages, poor error messages due to macro expansion, and macro-hygiene issues.
I agree a rewrite would ultimately be worth it.
The interesting thing is that all the passes do full AST traversals, and yet identify-locals is still much slower than most.
Profiling this a bit more, it seems like gc accounts for ~ half the runtime of the lowering passes. It represents ~45% of the total front end time when profiling some of @timholy's examples for instance. The parser doesn't hit the gc as hard (< %20 of the time in gc). (Note: this is the gc built into flisp)
Parser Elapsed time: 1.807146072387695 seconds
gc time 0.42045 sec (21.4%)
Syntax Elapsed time: 6.710570812225342 seconds
gc time 3.71732 sec (43.2%)
Function #Calls GC Time (seconds)
-------- ------ --------------
toplevel-expr-globals 1 0.01719307899475098
julia-expand-macros 1 0.03013706207275391
flatten-scopes 1 0.09003400802612305
analyze-variables 1 0.1040594577789307
julia-expand01 1 0.1680479049682617
to-goto-form 1 1.050569534301758
add-local-decls 1 1.821250200271606
identify-locals 1 1.821250200271606
julia-expand1 1 3.065913200378418
expand-toplevel-expr-- 1 3.281291246414185
expand-toplevel-expr 1 3.296869277954102
expand-toplevel-expr- 1 3.296869277954102
Function #Calls Time (seconds)
-------- ------ --------------
julia-expand-macros 1 0.05322694778442383
flatten-scopes 1 0.2244889736175537
toplevel-expr-globals 1 0.3535900115966797
analyze-variables 1 0.4261181354522705
julia-expand01 1 0.5767209529876709
to-goto-form 1 0.7994349002838135
add-local-decls 1 3.806056976318359
identify-locals 1 3.806058883666992
julia-expand1 1 5.256119966506958
expand-toplevel-expr-- 1 6.240168809890747
expand-toplevel-expr- 1 6.355370044708252
expand-toplevel-expr 1 6.355376005172729
Great that you can so easily extract gc time. Does the gc proportion scale as you'd expect? I.e., when testscript.jl
has a large number of functions, the runtime should scale linearly and the gc proportion should be constant.
It seems to behave as you would expect (running your macro example above).
julia> macro square(ex)
dosquare(ex)
end
julia> function dosquare(ex)
:((\$ex)^2)
end
dosquare (generic function with 1 method)
julia> ccall(:jl_lisp_prompt, Void, ())
; _
; |_ _ _ |_ _ | . _ _
; | (-||||_(_)|__|_)|_)
;-------------------|----------------------------------------------------------
> (load "test.scm")
500
Parser Elapsed time: 0.05537104606628418 seconds, GC time: 0.005069255828857422
Syntax Elapsed time: 0.157073974609375 seconds, GC time: 0.04458403587341309
1000
Elapsed time: 0.1065411567687988 seconds, GC time: 0.005226850509643555
Elapsed time: 0.308441162109375 seconds, GC time: 0.0815742015838623
5000
Elapsed time: 0.5404751300811768 seconds, GC time: 0.03504753112792969
Elapsed time: 2.353021860122681 seconds, GC time: 1.003882169723511
10000
Elapsed time: 1.0449538230896 seconds, GC time: 0.03677964210510254
Elapsed time: 5.802626132965088 seconds, GC time: 2.604867458343506
15000
Elapsed time: 1.554717063903809 seconds, GC time: 0.04884672164916992
Elapsed time: 8.680172920227051 seconds, GC time: 3.113471746444702
20000
Elapsed time: 2.053093910217285 seconds, GC time: 0.0437476634979248
Elapsed time: 16.08238792419434 seconds, GC time: 7.611555337905884
#t
At 25,000 functions the lowering step segfaults :-)
It's a little worse than linear, whether you look at raw times or subtract the gc time. For example, for 20k it's about 6-fold worse than 5k no matter how you slice it. This isn't so far from linear that I think it hints at anything nefarious, however.
@jakebolewski Could you post the script that causes the segfault? I assume that's not supposed to happen? :)
Another batch of optimizations is in. For me it takes using Gadfly
from ~27 sec to 23.7 sec.
We already have precompilation of the Julia system image. There's been a lot of discussion of how to precompile packages. This issue is for discussion of how to approach precompilation of packages.