Performance issue - Githubissues

omalyshe commented 7 years ago

I compared performance of libomp without OMPT and with OMPT enabled (under "enabled" I mean just building with LIBOMP_OMPT_SUPPORT=on; without using an ompt tool). Sources were used as of 20170718. The following benchmarks were used - EPCC microbenchmarks, SPEC OMP2012 benchmarks and tasking suite based on BOTS. Testes were run on IVB 48-cores and BDW 96-cores. For some benchmarks (they all use tasking) performance of OMPT-enabled library is significantly (7-13%) worse than performance of the library without OMPT.
This might be a serious problem for compiler vendors to enable OMPT in their OpenMP libraries.

khuck commented 7 years ago

Interesting...yes, that overhead is higher than expected. What was the relative overhead for the non-tasking benchmarks? Also, which compiler / version did you use?

There were (1 year ago?) known problems in the Intel/LLVM OpenMP library relating to GUID generation and excessive string processing. I think this might be related to the need to generate GUIDS for tasks - although that is speculation on my part. @jmellorcrummey and I have discussed this in the past, and he might know the details of both the overhead source and whether it has been resolved.

jprotze commented 7 years ago

We don't generate any IDs anymore, since TR4.

I reviewed the kmp_tasking.cpp and checked for OMPT code that is executed without a tool active. There were some code blocks where the if(ompt_enabled) was missing (71cfd42). Also the function calls for ompt_task_start / ompt_task_finish might come with some cost, so I moved the condition out of the function call. All in all this should not account for 7-13% overhead.

A breakdown of the overhead to kmpc-functions would be very helpful to understand and identify the issue.

omalyshe commented 7 years ago

Also, which compiler / version did you use?

Intel compiler 16.0.4. But it doesn't matter when comparing two libraries performance.

Overhead of non-tasking benchmarks measured with EPCC is less than 10% (this is a threshold that we usually use in our measurements). For non-tasking SPEC OMP2012 performance difference of two libraries < 3% (which is considered acceptable).

jmellorcrummey commented 7 years ago

I reviewed the changes that Joachim submitted a few minutes ago and these should help. Some of the EPCC tasking benchmarks are extremely fine-grain, so I would expect that these changes.

Olga - if you can provide any further information about which benchmarks incur the higher costs and information gathered with a sampling-based tool (e.g. VTune) that shows where the overhead is, that would help us resolve the issue.

omalyshe commented 7 years ago

Joachim, thank you for this quick change! I will verify if it improves performance.

khuck commented 7 years ago

John has a good suggestion - VTune or equivalent would help determine where the time is spent.

One other consideration - did you run with OMP_NUM_THREADS=max? Is the overhead different (lower) with fewer threads?

omalyshe commented 7 years ago

John, affected spec2012 benchmark is 376.kdtree (13% performance drop libomp-ompt vs.libomp), affected LCPC (BOTS) benchmarks are 342.concom (9%), 356.rubik (3.3 %).

omalyshe commented 7 years ago

Kevin, spec2012 tests are run with one thread per core (that is not using hyper-threads) only, LCPC benchmarks are run with different number of threads (for 48-core system #threads =12,24,48,96 - the more #threads the better performance for mentioned concom benchmark; rubik has the problem only with max number of threads). EPCC micro-benchmarks are run with different number of threads from 1 to #num_of_hyperthreads

jprotze commented 7 years ago

I don't expect OMPT specific overhead to depend on #threads. For kdtree at medium problem size, ~ 10^9 tasks are generated, but only ~10^6 can be deferred, the other tasks are generated with if(0).

The source of overhead I have on the list is the additional bytes in the taskdata structure. Probably, this can be optimized by moving the ompt_task_info to a place that does not affect runtime behavior. Maybe align it to cacheline and make sure to add padding?

jprotze commented 7 years ago

Not sure about the real performance impact of commit eaadcba , but it might reduce pressure on registers and cache.

jprotze commented 7 years ago

Would __builtin_expect help avoiding the execution of the ompt branch by default? Like in

if (__builtin_expect(ompt_enabled.enabled,0)) {

yanyh15 commented 7 years ago

I think so theoretically since it help branch prediction. but is that going to make gcc compiling required?

jprotze commented 7 years ago

@omalyshe I'm trying to reproduce your SPEC results for kdtree.

I'm running 24 threads on a dual-socket E5-2650 v4 @ 2.20GHz with 2x12 cores. Unfortunatelly, our SMP nodes (144 cores) are currently blocked. I compiled the benchmark with Intel/17 compiler and execute with Intel OpenMP runtime, LLVM/OpenMP runtime and LLVM/OpenMP+OMPT runtime.

With train problem size, all combinations run 7.67 seconds.

With ref problem size, the benchmark runs 794, 790, 792 seconds (median of 3 runs) for the three combinations.

When I run with sources from 20170718, the execution with LLVM/OpenMP+OMPT runtime takes 801 seconds. This would be about 1%, but not ~13% longer execution.

I downloaded BOT and run it with more significant problem sizes. I measured the highest overhead for nqueens.icc.omp-tasks-if_clause. With commit be1bce4 I use __builtin_expect for the tasking calls. This reduced runtime of nqueens(15) from 18.8 to 18 seconds, which is almost the same as the runtime without ompt-support in the runtime.

Probably you have some tools to profile the missed branch predictions? This way we might identify the points in the runtime, where __builtin_expect might improve the situation for OMPT.

omalyshe commented 7 years ago

Joachim, I didn't notice any performance difference (in my runs) with eaadcba . I currently run benchmarks on IVB E7-4850 v2 @ 2.30GHz 4 x 12 cores. kdtree times are 454 (no ompt) and 498 (ompt) on 20170810 sources.

jmellorcrummey commented 7 years ago

Olga, Can you use VTune (or another sampling based tool) to pinpoint what causes the slowdown with OMPT?

jprotze commented 7 years ago

I found the issue in my setup: While on the frontend node the execution used the intended libraries, the execution using LSF somehow skipped some options and always used the same, default OpenMP library.

I rerun the experiments with train size on the frontend and could see the performance differences: OpenMP runtime of icc-17: 7.68 s LLVM/OpenMP noompt: 7.57 s LLVM/OpenMP+ompt-20170718: 8.46 s LLVM/OpenMP+ompt-20170805: 8.38 s LLVM/OpenMP+ompt-20170812: 8.27 s

So, the latest changes improved the situation a bit, but there seems to be more potential for improvement.

jprotze commented 7 years ago

I used amplxe-perf to get some key numbers:

OpenMP runtime of icc-17:

     166412,646715      task-clock (msec)         #   21,712 CPUs utilized          
            12.357      context-switches          #    0,074 K/sec                  
               101      cpu-migrations            #    0,001 K/sec                  
            28.909      page-faults               #    0,174 K/sec                  
   415.995.917.449      cycles                    #    2,500 GHz                    
                 0      stalled-cycles-frontend   #    0,00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0,00% backend  cycles idle   
   917.142.895.434      instructions              #    2,20  insns per cycle        
   144.814.377.863      branches                  #  870,213 M/sec                  
       429.930.955      branch-misses             #    0,30% of all branches        

       7,664657303 seconds time elapsed

LLVM/OpenMP+ompt-20170718 (11c393e):

     186505,125583      task-clock (msec)         #   21,776 CPUs utilized          
            13.649      context-switches          #    0,073 K/sec                  
                67      cpu-migrations            #    0,000 K/sec                  
            34.555      page-faults               #    0,185 K/sec                  
   466.215.195.231      cycles                    #    2,500 GHz                    
                 0      stalled-cycles-frontend   #    0,00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0,00% backend  cycles idle   
 1.003.269.186.890      instructions              #    2,15  insns per cycle        
   150.054.429.267      branches                  #  804,559 M/sec                  
       672.916.519      branch-misses             #    0,45% of all branches        

       8,564871638 seconds time elapsed

LLVM/OpenMP+ompt-20170812 (be1bce4):

     182924,549814      task-clock (msec)         #   21,764 CPUs utilized          
            12.575      context-switches          #    0,069 K/sec                  
                68      cpu-migrations            #    0,000 K/sec                  
            35.609      page-faults               #    0,195 K/sec                  
   457.278.019.609      cycles                    #    2,500 GHz                    
                 0      stalled-cycles-frontend   #    0,00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0,00% backend  cycles idle   
   978.581.862.200      instructions              #    2,14  insns per cycle        
   142.881.269.750      branches                  #  781,094 M/sec                  
       435.929.071      branch-misses             #    0,31% of all branches        

       8,404851094 seconds time elapsed

With commit 6d7dd24

     180060,077806      task-clock (msec)         #   21,809 CPUs utilized          
            12.451      context-switches          #    0,069 K/sec                  
                54      cpu-migrations            #    0,000 K/sec                  
            30.255      page-faults               #    0,168 K/sec                  
   450.125.645.834      cycles                    #    2,500 GHz                    
                 0      stalled-cycles-frontend   #    0,00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0,00% backend  cycles idle   
   960.600.037.579      instructions              #    2,13  insns per cycle        
   138.882.375.727      branches                  #  771,311 M/sec                  
       435.255.061      branch-misses             #    0,31% of all branches        

       8,256088591 seconds time elapsed

Number of branches is at the level of Intel runtime, still a bit higher than LLVM/OpenMP runtime (which is kind of expected). The number of branch misses is at the level of both runtimes. The OMPT code has still significantly more instructions.

By selectively commenting out OMPT_SUPPORT-blocks, I could breakdown some of the overhead: in kmpc_omp_task_begin_if0 ~ 30Mrd instructions, .35 s in kmpc_omp_task_complete_if0 ~15Mrd instructions, .1s

omalyshe commented 7 years ago

Vtune Amplifier shows top hotspots in the following functions

The largest difference in Instructions retired is for the following functions kmpc_omp_task_complete_if0() kmpc_omp_task_begin_if0() kmpc_omp_taskwait() kmp_get_global_thread_id_reg()

jmellorcrummey commented 7 years ago

Olga - Is the left column the overhead and instructions retired WITHOUT OMPT support compiled in and the right column is WITH OMPT support compiled in?

omalyshe commented 7 years ago

John, sorry, yes - the left column is without OMPT and the right one is with OMPT.

jprotze commented 7 years ago

Comparing the object dump for task_begin_if0 shows that 7 registers are pushed/popped at the begin/end of the function.

Solutions discussed during the openmp-tools call:

outline the OMPT code into a separate function (might avoid the need to free the registers)
- the problem is then, that we run into issues calling builtin_frame_address / builtin_return_address
use versioning and have two versions of the function that have the OMPT-stuff or not
- we most probably don't need to roll that out for all cases, but only for hot functions like the above listed ones.

hansangbae commented 7 years ago

I will be working on fixes for this issue. I just committed a quick fix for 376.kdtree to my fork - 40bbdc6. The overhead was reduced from 10-13% down to 2-3% on a 96-core test machine.

jprotze commented 7 years ago

I can confirm the 3% runtime overhead at 144 threads on 144 cores. If this overhead is acceptable, we close this issue unless Olga finds a regression on a different benchmark (since we focused on kdtree).

jmellorcrummey commented 7 years ago

I would like to see Intel investigate this a bit more. In my opinion, 3% is a bit high for simply including the OMPT code without using it. I wouldn't be surprised if there is another place where function outlining would help.

omalyshe commented 7 years ago

Thank you, Joachim! 3% is the threshold we internally (Intel) use to distinguish acceptable overhead from not acceptable. So, we'll be investigating more.

jprotze commented 7 years ago

kdtree and nqueens are very similar to a recursive fibonacci. The benchmarks use a cut-off with #pragma omp task if(limit). For kdtree, 99.9% of the tasks are undeferred. In tutorials we teach people to use a manual cut-off in these cases, because cut-off with if(0) cannot perform for tiny tasks.

A while ago, I counted the tasks in kdtree for train size: 1 529 199 931 tasks in total, only 1.5M tasks are deferable tasks.

We are now down at ~919e9 instructions for OMPT, compared with ~898e9 for noOMPT. We have each 1 529 199 931 invocations of kmpc_omp_task_complete_if0(), kmpc_omp_task_begin_if0(), and __kmpc_omp_taskwait(). Per task the difference is ~13.7 instructions, distributed to the function calls that are <5 instructions. We need to load ompt-enabled, have the comparison and the conditional jump. Additionally we need to save the frame- and return-pointer.

To be honest, I realistically don't see potential to reduce the overhead for this worst-case benchmark below the 3%. This should not prevent anyone to find improvements for other cases.

hansangbae commented 7 years ago

I verified that all overheads are coming from the three _kmpc* entries we modified, so if we intend to improve the performance, we must do something within these three functions (I don't see any better way though). I also counted number of invocations of these three functions with ref data set, and they are called 1.85 billion times per thread, so just a few cycles of overhead per call can make noticeable performance difference.

jmellorcrummey commented 7 years ago

Here is an idea that might improve performance:

Take the original kmpc_omp_task_begin_if0 and rename it kmpc_omp_task_begin_if0_internal, mark it to be inlined, and add an extra argument:

void __kmpc_omp_task_begin_if0_internal(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task, bool WITH_OMPT) { // in this function, take out the ompt_enabled.enabled checks and replace then with if (WITH_OMPT) { ... } }

Then, declare two routines

void kmpc_omp_task_begin_if0_ompt(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task) { kmpc_omp_task_begin_if0_internal(loc_ref, gtid, task, true); }

__kmpc_omp_task_begin_if0_noompt

void kmpc_omp_task_begin_if0_noompt(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task) { kmpc_omp_task_begin_if0_internal(loc_ref, gtid, task, false); }

Then add a new __kmpc_omp_task_begin_if0 routine:

void __kmpc_omp_task_begin_if0_ompt(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task) { kmpc_omp_task_begin_if0_internal_fn(loc_ref, gtid, task); }

Then, after a tool may initialize OMPT, set the function pointer: kmpc_omp_task_begin_if0_internal_fn = (ompt_enabled.enabled ? __kmpc_omp_task_begin_if0_ompt : __kmpc_omp_task_begin_if0_noompt);

The cost of the indirect call through the function pointer kmpc_omp_task_begin_if0_internal_fn might be less than the cost of evaluating the if (UNLIKELY(ompt_enabled.enabled)) { ...} every time if OMPT is not enabled.

If that helps, then repeat for other important routines, including kmpc_omp_task_complete_if0, kmpc_omp_taskwait.

jprotze commented 7 years ago

You left out the issue of getting codeptr and frame address. There is no reliable approach to get these values if there is a runtime function call on the way. The only way I would see is to make __kmpc_omp_task_begin_if0 a pointer and set the pointer according to ompt_enabled

omalyshe commented 7 years ago

I verified that commit https://github.com/OpenMPToolsInterface/LLVM-openmp/pull/30/commits/40bbdc6c9e11e0a3e4e691fa26df77180fab2b2a fixed performance issue with kdtree spec2012 benchmark. On 48-core IVB, 96-core BDW and 64-core Xeon Phi performance of spec2012 benchmarks is the same with and without OMPT. Testing was run on sources as of Sept 18.

jprotze commented 6 years ago

New templated implementation of the fix is availabe in branch template-versioning-inline (https://github.com/OpenMPToolsInterface/LLVM-openmp/tree/template-versioning-inline). From maintainability point of view, this solution would be the preferred.

@omalyshe can you verify that this solution has about the same performance behavior?

omalyshe commented 6 years ago

Sure, I will run performance tests.

omalyshe commented 6 years ago

I ran SPECOMP 2012 on 48-core IVB, 96-core BDW and 64-core Xeon Phi. I compared libomp built from original code without ompt versus templated code with ompt.

There is no issue if libomp is built with clang 6.0. kdtree shows -2.97% perf degradation but this is in acceptable range. This slight slowdown is introduced by ompt in the templated version; the template itself doesn’t cause any degradation (orig_no-ompt is the same as templ_no-ompt).

However, if libomp is built with ICC 17.0 kdtree shows -5.15% perf degr on IVB. Both the template and ompt add to slowdown: templ_no-ompt vs. orig_no-ompt gives -2.91% and templ-ompt vs templ-no-ompt gives -2.30. We will probably make further internal investigation with ICC. For now I think this issue can be closed.

OpenMPToolsInterface / LLVM-openmp

Performance issue #29