Open LouisJenkinsCS opened 5 years ago
To backup my assertion that --ccflags=-O3
can significantly speed things up and that --no-checks
overhead is rather minimal in comparison, I ran the lulesh benchmark with perfcompopts
and with various compilation arguments...
chpl lulesh.chpl
Total Time: 42.6639
Time/Cycle: 0.0806501
Number of cycles: 529
chpl lulesh.chpl --ccflags=-O3
Total Time: 3.96433
Time/Cycle: 0.00749401
Number of cycles: 529
chpl lulesh.chpl --fast
Total Time: 1.46651
Time/Cycle: 0.00277224
Number of cycles: 529
So in terms of slowdown: without any compilation flags, I.E the default, it is 10x slower than with C backend optimizations disabled and 20x slower than with --fast
; however with just C backend optimizations, it is only 2x slower than --fast
, essentially meaning with the added --no-checks
. The takeaway is that you get the benefits of runtime checking as well as pretty darn good performance. Any thoughts?
Note: Obtained from release/1.19
Another important consideration is how much the optimizations impact compile time. (Edit: I think they're unlikely to have a big impact today, because the Chapel compiler itself is pretty slow, but one day the Chapel compiler will be faster. Nonetheless it's worth measuring).
Note that --baseline
does not perform --no-fast
behavior today. There are already changes that are applied by default which can get turned off by --baseline
I tend to find myself wanting this as well, though I agree with Michael that additional compilation time is the primary reason we haven't turned -O
on by default.
Collecting compilation time (makeBinary
vs total time
) + execution time
(w/ --filename=lmeshes/sedov15oct.lmesh
)
chpl lulesh.chpl --print-passes --no-local
makeBinary
took 20.162s (49.4%)
Total Time: 204.553
Time/Cycle: 0.386679
Number of cycles: 529
chpl lulesh.chpl --print-passes --no-local --ccflags=-O1
makeBinary
took 47.904s (69.6%)
Total Time: 19.6616
Time/Cycle: 0.0371675
Number of cycles: 529
(~2x compilation time, 10x faster execution time)
chpl lulesh.chpl --print-passes --no-local --ccflags=-O2
makeBinary
took 60.471s (71.3%)
Total Time: 18.9515
Time/Cycle: 0.0358252
Number of cycles: 529
(only slightly faster execution time, a bit longer compilation time)
chpl lulesh.chpl --print-passes --no-local --ccflags=-O3
makeBinary
took 69.009s (76.8%)
Total Time: 11.7666
Time/Cycle: 0.0222431
Number of cycles: 529
chpl lulesh.chpl --print-passes --no-local --fast
makeBinary
took 51.635s (74.0%)
Total Time: 4.61577
Time/Cycle: 0.00872546
Number of cycles: 529
In the end, there definitely is some extra compilation time, quite a bit in fact. However comparing it to the 10x execution time speedup, I kind of believe it should regardless of additional compilation time.
I think that -O2
should be the default, as -O3
is considered to be "too aggressive". I have had issues with it's optimizations in the past, and so has @ajoshpratt
I think that -O2 should be the default
This is along the lines of what I was thinking about on the way in: Maybe we default to something like -O1
or -O2
, consider throwing something more aggressive when --fast
is used, and have --baseline
and -g
drop these flags (where specific choices are likely to be compiler-dependent).
We are definitely in agreement then.
I sort of assumed* --fast
already threw -O2
in there, so having -O2
be the default is probably good.
yes yes, I know.
If the user does chpl prog.chpl
, it should perform -O2
optimization
If the user does chpl -g prog.chpl
, it should perform -Og
optimization
If the user does chpl --baseline prog.chpl
, it should perform -O0
If the user does chpl prog.chpl --fast
, it should perform -O3
optimization (perhaps -Ofast
?)
Is the current guidance to use --fast
for compiling for production? As a user, I wouldn't necessarily want --fast
to result in hard to fix bugs that are a result of optimization alone. I would just assume --fast
would lead to reasonable stability, assuming it worked without it (I get that's not always the case, but). Perhaps a separate flag other than fast for fast-but-possibly-unstable optimizations?
It's highly arbitrary, of course, but I sort of feel like not using fast should avoid optimizations as personally, I tend to want to make sure my code is running before taking off the speed limits, as it were. However, that's just me, and I imagine other people have very different opinions. Just some minor input.
I wouldn't necessarily want --fast to result in hard to fix bugs that are a result of optimization alone
Can you clarify what you're referring to here?
In the event that -O3
does result in weird problems, adding it as a default to --fast
could produce an issue.
I wish I had more examples of -O3
giving problems; it was years ago that I was trying that level of optimization out.
EDIT: It could be my information is super outdated on that. It does apparently still yield bugs where the programmer was trying something funky though.
EDIT 2: THE RETURN: according to a random stack overflow result, it could even result in slowdowns due to funrolling the loops, so who knows. Seems like a dicey option. (https://stackoverflow.com/questions/11546075/is-optimisation-level-o3-dangerous-in-g)
Looking at it, currently Chapel with --fast
compiles with -O3
Edit Added 'clangUtil' which directly states that -O3
is used.
One compilation flag that is often a staple for any build, whether for performing benchmarking or not, is the use of the
--fast
flag. However, from what I can tell, it does two things...1) Use
--no-checks
to disable all runtime checks for things like stack-overflow, nil-pointer dereference, and bounds checking(good) 2) Enables C optimizations via-O
(good)However, I think that even without
--fast
, C backend optimizations as well as Chapel optimizations should be performed, and that if no optimization is desired, that the user should use--baseline
as a way to not perform any optimizations. Under most circumstances from what I can see, the user would want runtime checks as well as some decently fast code. In fact, when I was running some of my code with just--ccflags=-O3
, I saw that it was almost as fast as--fast
but came with benefit of completing in time.Hence I ask this: Should backend compiler optimizations be enabled by default? If not, then why?