Closed ruby0x1 closed 8 years ago
Could you try compiling with -D no_simplify
?
haxe bin/mac64.build/build.hxml -D no_simplify --times
Total time : 2.230s
------------------------------------
filters : 0.400s, 18%
macro execution : 0.270s, 12%
other : 1.050s, 47%
parsing : 0.240s, 11%
typing : 0.270s, 12%
real 0m2.278s
user 0m2.165s
sys 0m0.102s
Seems related!
Ouch! Thanks for checking. I knew that the SSA transformer was a bit slow, but I've never had any such issues with the simplifier. I'll see what I can find.
Great thanks, this I can work at least without reverting each time.
I pinged you on gitter, you can also find me on IRC if you need me to try stuff, set up a dev env or anything like that.
This was mostly addressed by https://github.com/HaxeFoundation/haxe/commit/933778e13742d6ea327e31bb6340fb712f0092ab
New stats:
Total time : 4.280s
------------------------------------
analyzer-simplify-apply : 0.190s, 4%
filters : 2.050s, 48%
macro execution : 0.330s, 8%
other : 1.170s, 27%
parsing : 0.250s, 6%
typing : 0.290s, 7%
I would like to keep this issue open because gen_local
has problems. Take a look at this:
class Main {
static function main() {
makeSomeNoise();
}
macro static function makeSomeNoise() {
var el = [];
for (i in 0...10000) {
el.push(macro [1]);
}
return macro $b{el};
}
}
Total time : 0.325s
------------------------------------
filters : 0.108s, 33%
macro execution : 0.109s, 34%
neko generation : 0.048s, 15%
other : 0.000s, 0%
parsing : 0.029s, 9%
typing : 0.031s, 9%
Now we change the generated expression to a Map comprehension which internally creates a temp variable:
class Main {
static function main() {
makeSomeNoise();
}
macro static function makeSomeNoise() {
var el = [];
for (i in 0...10000) {
el.push(macro [1 => 2]);
}
return macro $b{el};
}
}
Total time : 25.143s
------------------------------------
filters : 0.934s, 4%
macro execution : 0.158s, 1%
neko generation : 0.202s, 1%
other : 0.000s, 0%
parsing : 0.029s, 0%
typing : 23.819s, 95%
If I replace the internal gen_local
with alloc_var "tmp"
(and let rename_local_vars
deal with the rest):
Total time : 1.429s
------------------------------------
filters : 0.895s, 63%
macro execution : 0.175s, 12%
neko generation : 0.201s, 14%
other : 0.000s, 0%
parsing : 0.030s, 2%
typing : 0.127s, 9%
I know we usually don't have 10000 temp vars, but that's two orders of magnitude!
@ncannasse: Is there any reason we actually need gen_local
or could we just alloc every generated temp var as "tmp" and let rename_local_vars
rename it? That should also give a small boost to completion performance because we don't do any string acrobatics until the post-processing stage.
We could try it, I think gen_local is quite old and was most likely introduced before rename_local_vars.
My test was invalid (I'm guessing I had rename_local_vars
commented out). The actual times are this:
Total time : 25.703s
------------------------------------
command : 0.162s, 1%
filters : 25.147s, 98%
generate js : 0.089s, 0%
macro execution : 0.157s, 1%
other : 0.001s, 0%
parsing : 0.026s, 0%
typing : 0.121s, 0%
It's still preferable to have this because at least completion can work, but rename_local_vars
obviously has the same problems as gen_local
.
Bah, there's actually code which relies on the name being unique during typing, so I don't see an easy fix for this right now.
@Simn maybe we should replace this code by using var ids ? That should give up a bit of speedup as well
I thought about that too, but then I remembered you wanting the temp vars to be nicely enumerated. Or was that just about them being enumerated consistently for the compilation server?
It's still not entirely safe because in theory there could be a user variable named tmp1234
.
I don't remember why I would care that much about local variable names, as long as the output is stable (don't vary between compilations), and if we can keep original user var naming as much as possible (for code output readability)
Using the var ids works nicely. There are some problems with the optimization tests though which compare the output with assumptions on the temp var naming.
Now that 3.2.0 is released (woo!) I was curious if there was any further investigation here?
For reference, against haxe 3.1.3 :
time haxe bin/mac64.build/build.hxml --times
Total time : 2.810s
------------------------------------
filters : 0.660s, 23%
macro execution : 0.280s, 10%
other : 1.270s, 45%
parsing : 0.260s, 9%
typing : 0.340s, 12%
real 0m2.873s
user 0m2.732s
sys 0m0.140s
And 3.2.0
sven$ time haxe bin/mac64.build/build.hxml --times
Total time : 5.540s
------------------------------------
analyzer-simplify-apply : 0.160s, 3%
filters : 2.480s, 45%
macro execution : 0.330s, 6%
other : 1.940s, 35%
parsing : 0.290s, 5%
typing : 0.340s, 6%
real 0m5.601s
user 0m5.431s
sys 0m0.171s
That's a lot slower than before and this concerns me because an argument for haxe being so fast to iterate with has slipped - I no longer can praise this when developing against the c++ target. With the js target it used to be 2:1 difference, and still be super quick in wall clock time, but now it's at 5:1 and it's wall clock time staring at the CLI, where previously i'd blink and it would be continuing to the actual slow part (the c++ compiler).
Here is 3.2.0 with -D no-simplify
, which is on par with 3.1.3 and back to respectably quick iteration times. So my real question I guess would be what are the negative consequences of permanently enabling the no-simplify
flag. I know the team does excellent work on the input and output of the compiler - so I hesitate to assume it's of no consequence. I don't want to disable things that are on by default but the cost of build times taking this much longer are far from negligible.
Is there anything I can help test with? Anything to investigate on the user side?
sven$ time haxe bin/mac64.build/build.hxml --times -D no-simplify
Total time : 2.720s
------------------------------------
filters : 0.500s, 18%
macro execution : 0.320s, 12%
other : 1.220s, 45%
parsing : 0.240s, 9%
typing : 0.440s, 16%
real 0m2.770s
user 0m2.633s
sys 0m0.135s
For comparison, with the js target:
3.1.3
time haxe bin/web.build/build.hxml --times
Total time : 1.040s
------------------------------------
filters : 0.250s, 24%
generate js : 0.070s, 7%
macro execution : 0.270s, 26%
other : 0.000s, 0%
parsing : 0.240s, 23%
typing : 0.210s, 20%
3.2.0
time haxe bin/web.build/build.hxml --times
Total time : 1.150s
------------------------------------
filters : 0.260s, 23%
generate js : 0.070s, 6%
macro execution : 0.320s, 28%
other : 0.020s, 2%
parsing : 0.230s, 20%
typing : 0.250s, 22%
real 0m1.166s
user 0m1.093s
sys 0m0.073s
The whole simplifier/analysis part is in its first iteration so there are still a few implementation problems with it that can be optimized. The reason simplification is enabled by default is to ensure correct evaluation order on the C++ target. Consider this example:
class Main {
static function main() {
var x = 0;
call(x++, x);
}
static function call(a:Int, b:Int) {
Sys.println(a + " " + b);
}
}
With the simplifier this translates to:
int x = (int)0;
int tmp = (x)++;
int tmp1 = x;
::Main_obj::call(tmp,tmp1);
Whereas without it you get:
int x = (int)0;
::Main_obj::call((x)++,x);
This version has undefined evaluation order in C++. You will likely get 0 0 as output as opposed to the expected 0 1.
In 3.1.3 there was a simplified version of the simplifier which was much faster but missed some cases. My strategy here is to first get something that works, cover all cases in the tests and then start optimizing the implementation. We also lose a lot of time simply because the AST is much larger than it used to be which can be alleviated by reconstructing parts of the AST that are harmless.
@Simn could you investigate ways to speedup this part ? even with a big AST, that should not take most of the time spent in Haxe compiler (if we exclude other which is c++ compiler, the simplify seems to take 55% of the compilation time by itself)
Where does the 55% figure come from? I'm looking at analyzer-simplify-apply : 0.160s, 3%
. Our unit tests have analyzer-simplify-apply : 0.072s, 0%
with other : 52.346s, 93%
.
haxe compile-cpp.hxml --times --no-output
Total time : 2.815s
------------------------------------
analyzer-simplify-apply : 0.071s, 3%
filters : 1.329s, 47%
And with rename_local_vars
commented out:
haxe compile-cpp.hxml --times --no-output
Total time : 1.778s
------------------------------------
analyzer-simplify-apply : 0.056s, 3%
filters : 0.321s, 18%
So this is again the same problem which I originally identified and there is no reason to panic.
(time spent in filters 2.48 - time spent in filters with no-simplify 0.5) = 1.98 divided by (total compilation time 5.54 - time spent in other 1.94) = 3.6 = 0.55
Maybe that's not analyzer-simplify-apply, but some other filters that gets applied when there's no -D no-simplify and that takes this time. Don't worry I'm not panicking in any way, just want it to get investigated so we can improve our compilation speed :)
Yes but I almost panicked because I thought this was some new problem that suddenly came up. :)
I will look into optimizing rename_local_vars
. My thinking is that we should just use the variable IDs if we use the simplifier without unapplying it. In that case the output is gonna "look like a mess" anyway so we might as well use numeric variable names.
The reason simplification is enabled by default is to ensure correct evaluation order on the C++ target.
Does this mean js doesn't use it (yet)?
Also ; I am curious about the "other", considering that js is 0.000/0.020 and c++ 1.270/1.940 for 3.1.3/3.2.0 respectively, this is also potentially a place for a general improvement (aside from the simplification costs) - perhaps this is also a place for shaving a few hundred ms on the c++ side.
I don't doubt the vast differences in target, but i am curious what makes the difference that wide, from 0 to 1.27s. Is there a way to make the timing as verbose as possible?
It looks like C++ has "other" because it has no "generate cpp" timer. I'll unify the timer behavior so the individual generators don't have to deal with it.
And no JS doesn't use the simplifier because it defines the same left-to-right evaluation order Haxe does.
@underscorediscovery yes, other is most likely time spent by the C++ compiler/linker, which JS does not have
@ncannasse the c++ compiler & linker is not included in c++ generation time above.
I was specifically referring to "other" in the haxe --times
list.
? yes I'm also talking about that "other". Why are you saying it's not included ? Usually haxe compiler will call hxcpp build tool which will run c++ compiler. This should be included in "other" unless hxcpp has a timer specific to it, or you disabled compilation in you build.hxml, which we can't know because we don't know what's inside
I clarified my reply, I was trying to be explicit/clear. There is no c++ compilation/linking happening.
(I was in the mindset that this post is about cpp gen, I would not include the c++ compiler time in there. Also: I would love if c++ compiling/linking took < 2s! But no - I wouldn't hold haxe to the c++ compiler times ofc, apologies for confusion of the wording)
I thought this was just because the output files had no changes which would only invoke the linker. In that case I'm not sure what "other" actually is, but as I said I'll go ahead and work on the timers.
Ah, ok, well here is some more reference of the code I was testing against.
sven$ haxe bin/mac64.build/build.hxml --times --no-output -D no-compilation
Total time : 3.830s
------------------------------------
analyzer-simplify-apply : 0.210s, 5%
filters : 2.530s, 66%
macro execution : 0.400s, 10%
other : 0.000s, 0%
parsing : 0.300s, 8%
typing : 0.390s, 10%
sven$ haxe bin/mac64.build/build.hxml --times -D no-compilation
Total time : 5.840s
------------------------------------
analyzer-simplify-apply : 0.160s, 3%
filters : 2.600s, 45%
macro execution : 0.350s, 6%
other : 2.070s, 35%
parsing : 0.250s, 4%
typing : 0.410s, 7%
This is haxe 3.2.0 and the "other" seems to imply it's the c++ output itself.
I have updated the timers. From our unit tests I'm now getting:
analyzer-simplify-apply : 0.053s, 1%
filters : 1.332s, 13%
generate cpp : 1.579s, 16%
generate cpp - native compilation : 5.577s, 56%
other : 0.020s, 0%
parsing : 0.259s, 3%
typing : 0.513s, 5%
Let's wait for a few days until we can merge the haxe-3.3 branch into development so you can check for yourself with the nightly builds.
Cool, I'll keep an eye out thanks. Also note the times above - the "other" is 0.000 when there is --no-output. That's 2s+, if that is somehow just unoptimized, it might be a good place to save time short term till the simplifier is more progressed.
That "other" is likely the part of gencpp that writes out the .cpp/.h files, i.e. "generate cpp" in the new timers. There might be room for improvement but I'm not sure if that's worth it given that the total compilation time is going to be dominated by the native compilation anyway.
This entire thread is about the time here being "worth it". It contributes 2/5 seconds, and 2/2.7 seconds in 3.1.3, this is definitely something to consider worth optimizing, if possible, as it's a large chunk of the entire problem (long cpp gen times).
That's not what your output tells. Between your 3.2.0 and 3.1.3 times there's a 0.7s for "other", whereas there's a 2.2s difference for "filters". The latter is what Nicolas described as "in Haxe compiler" and we should definitely address that. A portion of the 0.7s come from the simple fact that the output is larger and whatever difference remains is insignificant when put next to the native compilation times.
I was referring to the whole, my current problem/concern (as in 3.2.0 current) is that 5+ seconds is too long. If just less than half of that is the output - I was just positing it may be low hanging fruit - as you said you wanted to get the simplifier working right and I agree with the approach.
I'm sure this gencpp time can be improved, and this will offer a significant speedup in the case only one file has been modified since in that case we only compile this specific file + link
I don't doubt that there can be improvements, but significant speedup?
generate cpp : 1.496s, 14%
generate cpp - native compilation : 6.482s, 59%
That's for a no-change compilation. For any real compilation involving a few files the result is gonna be skewed towards the native compilation even more.
Granted, VC is not exactly known for its amazing compilation speed.
Just to re-note, I am not considering the cpp compilation times, ever, above. I never run cpp builds through haxe itself, my long wait is well before that happens - I don't run hxcpp via the haxe executable.
I see, so I suppose my problem is that I view both generation steps as belonging together.
Still, I don't think there's much to gain in the gencpp part itself.
I'll add my 5 cents. Haxe 3.1.3 and 3.2.0, openfl 2.2.4, hxcpp 3.1.68, gcc 4.8.4.
$ haxelib run openfl create PiratePig
$ cd PiratePig/
$ time haxelib run openfl build linux
real 0m58.774s
user 3m20.226s
sys 0m9.687s
$ time haxelib run openfl build linux
real 0m0.883s
user 0m0.791s
sys 0m0.089s
$ # comment some irrelevant line in Source/piratepig/PiratePig.hx
$ time haxelib run openfl build linux
real 0m1.676s
user 0m1.498s
sys 0m0.171s
First one compiles everything, second compiles none, but runs haxe, last one compiles one cpp file, link one binary and strip it.
$ haxe Export/linux64/cpp/haxe/release.hxml -D HXCPP_M64 --times
Total time : 0.690s
------------------------------------
filters : 0.120s, 17%
macro execution : 0.070s, 10%
other : 0.250s, 36%
parsing : 0.140s, 20%
typing : 0.110s, 16%
$ HAXE_STD_PATH=~/haxe/haxe-3.2.0/std:. ~/haxe/haxe-3.2.0/haxe Export/linux64/cpp/haxe/release.hxml -D HXCPP_M64 --times
Total time : 1.100s
------------------------------------
analyzer-simplify-apply : 0.010s, 1%
filters : 0.350s, 32%
macro execution : 0.090s, 8%
other : 0.390s, 35%
parsing : 0.130s, 12%
typing : 0.130s, 12%
$ HAXE_STD_PATH=~/haxe/haxe-3.2.0/std:. ~/haxe/haxe-3.2.0/haxe Export/linux64/cpp/haxe/release.hxml -D HXCPP_M64 --times -D no_simplify
Total time : 0.830s
------------------------------------
filters : 0.090s, 11%
macro execution : 0.090s, 11%
other : 0.350s, 42%
parsing : 0.140s, 17%
typing : 0.160s, 19%
hxml used
-main ApplicationMain
-cp /home/cheshkov/haxelib/openfl/2,2,4
-D openfl=2.2.4
-cp /home/cheshkov/haxelib/hxcpp/3,1,68
-D hxcpp=3.1.68
-cp /home/cheshkov/haxelib/actuate/1,8,1
-D actuate=1.8.1
-cp Source
-cp /home/cheshkov/haxelib/lime/2,0,6/legacy
-D tools=2.0.6
-D lime-legacy
-D no-compilation
-D native
-D openfl-native
-D lime-native
-D linux
-D desktop
--remap flash:openfl
-cp Export/linux64/cpp/haxe
-cpp Export/linux64/cpp/obj/
--macro keep("piratepig.PiratePig")
So yes, short iteration with little changes with native compilation can be as short as two seconds, and up to half of that is haxe itself.
Could you guys check again with current development version?
I looked into the local name issue again and it occurred to me that we might have to be careful with generated local names after all in case a typed expression gets re-typed through macros. If we generated all locals with the same name it could lead to a wrong ctx.locals
lookup in case two locals are generated in the same scope.
This is probably a rather rare case, but I would rather not take any chances. A better approach is to keep track of the number of locals per field and generate them as `_1,
_2` etc.
Regarding rename_local_vars
, I'm afraid using variable IDs would lead to terrible diffs which at this point is something I care about. It might be better to just keep count
per field. This might lead to numbering holes in some cases, but I can't see anyone caring about that.
I'll give this a try soon as I can (holidays), thanks!
I haven't heard of any problems so I'll consider this closed. Feel free to open a new issue if you come across similar problems.
@Simn I tried the latest nightly just now, sorry for the delay I have been swamped!
Testing against 3.3.0 (git build development @ a1d47d6)
I also just tried 3.2.1 on the same test and it ~6.240s, way higher.
It definitely seems to be improved, good work. Now the bulk of the time is spent in generate cpp, and macros I have to optimize:
haxe bin/mac64.build/build.hxml --times -D no-compilation
Total time : 4.840s
------------------------------------
analyzer-cleanup : 0.020s, 0%
analyzer-filter-apply : 0.020s, 0%
analyzer-from-texpr : 0.170s, 4%
analyzer-fusion : 0.120s, 2%
analyzer-to-texpr : 0.040s, 1%
filters : 0.580s, 12%
generate cpp : 1.470s, 30%
macro execution : 1.130s, 23%
other : 0.020s, 0%
parsing : 0.640s, 13%
typing : 0.630s, 13%
In a smaller code base the difference is bigger because the macros are the highest:
haxe bin/mac64.build/build.hxml --times -D no-compilation
Total time : 1.920s
------------------------------------
analyzer-cleanup : 0.000s, 0%
analyzer-filter-apply : 0.000s, 0%
analyzer-from-texpr : 0.010s, 1%
analyzer-fusion : 0.020s, 1%
analyzer-to-texpr : 0.010s, 1%
filters : 0.150s, 8%
generate cpp : 0.120s, 6%
macro execution : 0.820s, 43%
other : 0.010s, 1%
parsing : 0.280s, 15%
typing : 0.500s, 26%
Cool, thanks for checking!
It possibly affects other platforms as well, but I have seen this on Mac.
I saw this happen around 4 months ago, but didn't require 3.2 and figured it was general development hiccups that would pass.
Now, I need to use 3.2 daily, and it's near impossible :)
The version I am using is
haxe_2015-01-27_development_456a1a8.tar.gz
No matter if I use zip/installer, result is the same.I had tried
latest
on the 28th, it wasn't working for other reasons.Below is the output, run directly from cli , using
haxe bin/mac64.build/build.hxml --times -v
A bonus screenshot, the profiler showing where it's spending it's time if that helps at all.