chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

[Discussion] Should optimizations be enabled by default? #13321

Open LouisJenkinsCS opened 5 years ago

LouisJenkinsCS commented 5 years ago

One compilation flag that is often a staple for any build, whether for performing benchmarking or not, is the use of the --fast flag. However, from what I can tell, it does two things...

1) Use --no-checks to disable all runtime checks for things like stack-overflow, nil-pointer dereference, and bounds checking(good) 2) Enables C optimizations via -O (good)

However, I think that even without --fast, C backend optimizations as well as Chapel optimizations should be performed, and that if no optimization is desired, that the user should use --baseline as a way to not perform any optimizations. Under most circumstances from what I can see, the user would want runtime checks as well as some decently fast code. In fact, when I was running some of my code with just --ccflags=-O3, I saw that it was almost as fast as --fast but came with benefit of completing in time.

Hence I ask this: Should backend compiler optimizations be enabled by default? If not, then why?

LouisJenkinsCS commented 5 years ago

To backup my assertion that --ccflags=-O3 can significantly speed things up and that --no-checks overhead is rather minimal in comparison, I ran the lulesh benchmark with perfcompopts and with various compilation arguments...

chpl lulesh.chpl

Total Time: 42.6639
Time/Cycle: 0.0806501
Number of cycles: 529

chpl lulesh.chpl --ccflags=-O3

Total Time: 3.96433
Time/Cycle: 0.00749401
Number of cycles: 529

chpl lulesh.chpl --fast

Total Time: 1.46651
Time/Cycle: 0.00277224
Number of cycles: 529

So in terms of slowdown: without any compilation flags, I.E the default, it is 10x slower than with C backend optimizations disabled and 20x slower than with --fast; however with just C backend optimizations, it is only 2x slower than --fast, essentially meaning with the added --no-checks. The takeaway is that you get the benefits of runtime checking as well as pretty darn good performance. Any thoughts?

Note: Obtained from release/1.19

mppf commented 5 years ago

Another important consideration is how much the optimizations impact compile time. (Edit: I think they're unlikely to have a big impact today, because the Chapel compiler itself is pretty slow, but one day the Chapel compiler will be faster. Nonetheless it's worth measuring).

lydia-duncan commented 5 years ago

Note that --baseline does not perform --no-fast behavior today. There are already changes that are applied by default which can get turned off by --baseline

bradcray commented 5 years ago

I tend to find myself wanting this as well, though I agree with Michael that additional compilation time is the primary reason we haven't turned -O on by default.

LouisJenkinsCS commented 5 years ago

Collecting compilation time (makeBinary vs total time) + execution time (w/ --filename=lmeshes/sedov15oct.lmesh)

chpl lulesh.chpl --print-passes --no-local

makeBinary took 20.162s (49.4%)

Total Time: 204.553
Time/Cycle: 0.386679
Number of cycles: 529

chpl lulesh.chpl --print-passes --no-local --ccflags=-O1

makeBinary took 47.904s (69.6%)

Total Time: 19.6616
Time/Cycle: 0.0371675
Number of cycles: 529

(~2x compilation time, 10x faster execution time)

chpl lulesh.chpl --print-passes --no-local --ccflags=-O2

makeBinary took 60.471s (71.3%)

Total Time: 18.9515
Time/Cycle: 0.0358252
Number of cycles: 529

(only slightly faster execution time, a bit longer compilation time)

chpl lulesh.chpl --print-passes --no-local --ccflags=-O3

makeBinary took 69.009s (76.8%)

Total Time: 11.7666                                                                                                                                                                         
Time/Cycle: 0.0222431                                                                                                                                                                       
Number of cycles: 529 

chpl lulesh.chpl --print-passes --no-local --fast

makeBinary took 51.635s (74.0%)

Total Time: 4.61577                                                                                                                                                                         
Time/Cycle: 0.00872546                                                                                                                                                                      
Number of cycles: 529 

In the end, there definitely is some extra compilation time, quite a bit in fact. However comparing it to the 10x execution time speedup, I kind of believe it should regardless of additional compilation time.

LouisJenkinsCS commented 5 years ago

I think that -O2 should be the default, as -O3 is considered to be "too aggressive". I have had issues with it's optimizations in the past, and so has @ajoshpratt

bradcray commented 5 years ago

I think that -O2 should be the default

This is along the lines of what I was thinking about on the way in: Maybe we default to something like -O1 or -O2, consider throwing something more aggressive when --fast is used, and have --baseline and -g drop these flags (where specific choices are likely to be compiler-dependent).

LouisJenkinsCS commented 5 years ago

We are definitely in agreement then.

astatide commented 5 years ago

I sort of assumed* --fast already threw -O2 in there, so having -O2 be the default is probably good.

yes yes, I know.

LouisJenkinsCS commented 5 years ago

If the user does chpl prog.chpl, it should perform -O2 optimization If the user does chpl -g prog.chpl, it should perform -Og optimization If the user does chpl --baseline prog.chpl, it should perform -O0 If the user does chpl prog.chpl --fast, it should perform -O3 optimization (perhaps -Ofast?)

astatide commented 5 years ago

Is the current guidance to use --fast for compiling for production? As a user, I wouldn't necessarily want --fast to result in hard to fix bugs that are a result of optimization alone. I would just assume --fast would lead to reasonable stability, assuming it worked without it (I get that's not always the case, but). Perhaps a separate flag other than fast for fast-but-possibly-unstable optimizations?

It's highly arbitrary, of course, but I sort of feel like not using fast should avoid optimizations as personally, I tend to want to make sure my code is running before taking off the speed limits, as it were. However, that's just me, and I imagine other people have very different opinions. Just some minor input.

bradcray commented 5 years ago

I wouldn't necessarily want --fast to result in hard to fix bugs that are a result of optimization alone

Can you clarify what you're referring to here?

astatide commented 5 years ago

In the event that -O3 does result in weird problems, adding it as a default to --fast could produce an issue.

I wish I had more examples of -O3 giving problems; it was years ago that I was trying that level of optimization out.

EDIT: It could be my information is super outdated on that. It does apparently still yield bugs where the programmer was trying something funky though.

EDIT 2: THE RETURN: according to a random stack overflow result, it could even result in slowdowns due to funrolling the loops, so who knows. Seems like a dicey option. (https://stackoverflow.com/questions/11546075/is-optimisation-level-o3-dangerous-in-g)

LouisJenkinsCS commented 5 years ago

Looking at it, currently Chapel with --fast compiles with -O3

https://github.com/chapel-lang/chapel/blob/92d0a38872e9bb2e63b436397641c2853204b75d/compiler/main/driver.cpp#L896

https://github.com/chapel-lang/chapel/blob/92d0a38872e9bb2e63b436397641c2853204b75d/compiler/main/driver.cpp#L695-L729

https://github.com/chapel-lang/chapel/blob/47cedac210576a2a21e0b11ee469f8ed7c0b3a2a/compiler/llvm/clangUtil.cpp#L1159-L1162

Edit Added 'clangUtil' which directly states that -O3 is used.

LouisJenkinsCS commented 5 years ago

I think that given that a lot of -O3 bugs I have found by looking around for them (1, 2), they seem to stem from -ftree-vectorize; perhaps by default this should be disabled by default, and enabled via explicit flags? Its basically a trial-and-error kind of thing.