Closed omalyshe closed 6 years ago
Interesting...yes, that overhead is higher than expected. What was the relative overhead for the non-tasking benchmarks? Also, which compiler / version did you use?
There were (1 year ago?) known problems in the Intel/LLVM OpenMP library relating to GUID generation and excessive string processing. I think this might be related to the need to generate GUIDS for tasks - although that is speculation on my part. @jmellorcrummey and I have discussed this in the past, and he might know the details of both the overhead source and whether it has been resolved.
We don't generate any IDs anymore, since TR4.
I reviewed the kmp_tasking.cpp and checked for OMPT code that is executed without a tool active. There were some code blocks where the if(ompt_enabled) was missing (71cfd42). Also the function calls for ompt_task_start / ompt_task_finish might come with some cost, so I moved the condition out of the function call. All in all this should not account for 7-13% overhead.
A breakdown of the overhead to kmpc-functions would be very helpful to understand and identify the issue.
Also, which compiler / version did you use?
Intel compiler 16.0.4. But it doesn't matter when comparing two libraries performance.
Overhead of non-tasking benchmarks measured with EPCC is less than 10% (this is a threshold that we usually use in our measurements). For non-tasking SPEC OMP2012 performance difference of two libraries < 3% (which is considered acceptable).
I reviewed the changes that Joachim submitted a few minutes ago and these should help. Some of the EPCC tasking benchmarks are extremely fine-grain, so I would expect that these changes.
Olga - if you can provide any further information about which benchmarks incur the higher costs and information gathered with a sampling-based tool (e.g. VTune) that shows where the overhead is, that would help us resolve the issue.
Joachim, thank you for this quick change! I will verify if it improves performance.
John has a good suggestion - VTune or equivalent would help determine where the time is spent.
One other consideration - did you run with OMP_NUM_THREADS=max? Is the overhead different (lower) with fewer threads?
John, affected spec2012 benchmark is 376.kdtree (13% performance drop libomp-ompt vs.libomp), affected LCPC (BOTS) benchmarks are 342.concom (9%), 356.rubik (3.3 %).
Kevin, spec2012 tests are run with one thread per core (that is not using hyper-threads) only, LCPC benchmarks are run with different number of threads (for 48-core system #threads =12,24,48,96 - the more #threads the better performance for mentioned concom benchmark; rubik has the problem only with max number of threads). EPCC micro-benchmarks are run with different number of threads from 1 to #num_of_hyperthreads
I don't expect OMPT specific overhead to depend on #threads. For kdtree at medium problem size, ~ 10^9 tasks are generated, but only ~10^6 can be deferred, the other tasks are generated with if(0).
The source of overhead I have on the list is the additional bytes in the taskdata structure. Probably, this can be optimized by moving the ompt_task_info to a place that does not affect runtime behavior. Maybe align it to cacheline and make sure to add padding?
Not sure about the real performance impact of commit eaadcba , but it might reduce pressure on registers and cache.
Would __builtin_expect
help avoiding the execution of the ompt branch by default? Like in
if (__builtin_expect(ompt_enabled.enabled,0)) {
I think so theoretically since it help branch prediction. but is that going to make gcc compiling required?
@omalyshe I'm trying to reproduce your SPEC results for kdtree.
I'm running 24 threads on a dual-socket E5-2650 v4 @ 2.20GHz with 2x12 cores. Unfortunatelly, our SMP nodes (144 cores) are currently blocked. I compiled the benchmark with Intel/17 compiler and execute with Intel OpenMP runtime, LLVM/OpenMP runtime and LLVM/OpenMP+OMPT runtime.
With train problem size, all combinations run 7.67 seconds.
With ref problem size, the benchmark runs 794, 790, 792 seconds (median of 3 runs) for the three combinations.
When I run with sources from 20170718, the execution with LLVM/OpenMP+OMPT runtime takes 801 seconds. This would be about 1%, but not ~13% longer execution.
I downloaded BOT and run it with more significant problem sizes. I measured the highest overhead for nqueens.icc.omp-tasks-if_clause. With commit be1bce4 I use __builtin_expect for the tasking calls. This reduced runtime of nqueens(15) from 18.8 to 18 seconds, which is almost the same as the runtime without ompt-support in the runtime.
Probably you have some tools to profile the missed branch predictions? This way we might identify the points in the runtime, where __builtin_expect might improve the situation for OMPT.
Joachim, I didn't notice any performance difference (in my runs) with eaadcba . I currently run benchmarks on IVB E7-4850 v2 @ 2.30GHz 4 x 12 cores. kdtree times are 454 (no ompt) and 498 (ompt) on 20170810 sources.
Olga, Can you use VTune (or another sampling based tool) to pinpoint what causes the slowdown with OMPT?
I found the issue in my setup: While on the frontend node the execution used the intended libraries, the execution using LSF somehow skipped some options and always used the same, default OpenMP library.
I rerun the experiments with train size on the frontend and could see the performance differences: OpenMP runtime of icc-17: 7.68 s LLVM/OpenMP noompt: 7.57 s LLVM/OpenMP+ompt-20170718: 8.46 s LLVM/OpenMP+ompt-20170805: 8.38 s LLVM/OpenMP+ompt-20170812: 8.27 s
So, the latest changes improved the situation a bit, but there seems to be more potential for improvement.
I used amplxe-perf to get some key numbers:
OpenMP runtime of icc-17:
166412,646715 task-clock (msec) # 21,712 CPUs utilized
12.357 context-switches # 0,074 K/sec
101 cpu-migrations # 0,001 K/sec
28.909 page-faults # 0,174 K/sec
415.995.917.449 cycles # 2,500 GHz
0 stalled-cycles-frontend # 0,00% frontend cycles idle
0 stalled-cycles-backend # 0,00% backend cycles idle
917.142.895.434 instructions # 2,20 insns per cycle
144.814.377.863 branches # 870,213 M/sec
429.930.955 branch-misses # 0,30% of all branches
7,664657303 seconds time elapsed
LLVM/OpenMP+ompt-20170718 (11c393e):
186505,125583 task-clock (msec) # 21,776 CPUs utilized
13.649 context-switches # 0,073 K/sec
67 cpu-migrations # 0,000 K/sec
34.555 page-faults # 0,185 K/sec
466.215.195.231 cycles # 2,500 GHz
0 stalled-cycles-frontend # 0,00% frontend cycles idle
0 stalled-cycles-backend # 0,00% backend cycles idle
1.003.269.186.890 instructions # 2,15 insns per cycle
150.054.429.267 branches # 804,559 M/sec
672.916.519 branch-misses # 0,45% of all branches
8,564871638 seconds time elapsed
LLVM/OpenMP+ompt-20170812 (be1bce4):
182924,549814 task-clock (msec) # 21,764 CPUs utilized
12.575 context-switches # 0,069 K/sec
68 cpu-migrations # 0,000 K/sec
35.609 page-faults # 0,195 K/sec
457.278.019.609 cycles # 2,500 GHz
0 stalled-cycles-frontend # 0,00% frontend cycles idle
0 stalled-cycles-backend # 0,00% backend cycles idle
978.581.862.200 instructions # 2,14 insns per cycle
142.881.269.750 branches # 781,094 M/sec
435.929.071 branch-misses # 0,31% of all branches
8,404851094 seconds time elapsed
With commit 6d7dd24
180060,077806 task-clock (msec) # 21,809 CPUs utilized
12.451 context-switches # 0,069 K/sec
54 cpu-migrations # 0,000 K/sec
30.255 page-faults # 0,168 K/sec
450.125.645.834 cycles # 2,500 GHz
0 stalled-cycles-frontend # 0,00% frontend cycles idle
0 stalled-cycles-backend # 0,00% backend cycles idle
960.600.037.579 instructions # 2,13 insns per cycle
138.882.375.727 branches # 771,311 M/sec
435.255.061 branch-misses # 0,31% of all branches
8,256088591 seconds time elapsed
Number of branches is at the level of Intel runtime, still a bit higher than LLVM/OpenMP runtime (which is kind of expected). The number of branch misses is at the level of both runtimes. The OMPT code has still significantly more instructions.
By selectively commenting out OMPT_SUPPORT-blocks, I could breakdown some of the overhead: in kmpc_omp_task_begin_if0 ~ 30Mrd instructions, .35 s in kmpc_omp_task_complete_if0 ~15Mrd instructions, .1s
Vtune Amplifier shows top hotspots in the following functions
The largest difference in Instructions retired is for the following functions kmpc_omp_task_complete_if0() kmpc_omp_task_begin_if0() kmpc_omp_taskwait() kmp_get_global_thread_id_reg()
Olga - Is the left column the overhead and instructions retired WITHOUT OMPT support compiled in and the right column is WITH OMPT support compiled in?
John, sorry, yes - the left column is without OMPT and the right one is with OMPT.
Comparing the object dump for task_begin_if0 shows that 7 registers are pushed/popped at the begin/end of the function.
Solutions discussed during the openmp-tools call:
I will be working on fixes for this issue. I just committed a quick fix for 376.kdtree to my fork - 40bbdc6. The overhead was reduced from 10-13% down to 2-3% on a 96-core test machine.
I can confirm the 3% runtime overhead at 144 threads on 144 cores. If this overhead is acceptable, we close this issue unless Olga finds a regression on a different benchmark (since we focused on kdtree).
I would like to see Intel investigate this a bit more. In my opinion, 3% is a bit high for simply including the OMPT code without using it. I wouldn't be surprised if there is another place where function outlining would help.
Thank you, Joachim! 3% is the threshold we internally (Intel) use to distinguish acceptable overhead from not acceptable. So, we'll be investigating more.
kdtree and nqueens are very similar to a recursive fibonacci.
The benchmarks use a cut-off with #pragma omp task if(limit)
. For kdtree, 99.9% of the tasks are undeferred. In tutorials we teach people to use a manual cut-off in these cases, because cut-off with if(0) cannot perform for tiny tasks.
A while ago, I counted the tasks in kdtree for train size: 1 529 199 931 tasks in total, only 1.5M tasks are deferable tasks.
We are now down at ~919e9 instructions for OMPT, compared with ~898e9 for noOMPT. We have each 1 529 199 931 invocations of kmpc_omp_task_complete_if0(), kmpc_omp_task_begin_if0(), and __kmpc_omp_taskwait(). Per task the difference is ~13.7 instructions, distributed to the function calls that are <5 instructions. We need to load ompt-enabled, have the comparison and the conditional jump. Additionally we need to save the frame- and return-pointer.
To be honest, I realistically don't see potential to reduce the overhead for this worst-case benchmark below the 3%. This should not prevent anyone to find improvements for other cases.
I verified that all overheads are coming from the three _kmpc* entries we modified, so if we intend to improve the performance, we must do something within these three functions (I don't see any better way though). I also counted number of invocations of these three functions with ref data set, and they are called 1.85 billion times per thread, so just a few cycles of overhead per call can make noticeable performance difference.
Here is an idea that might improve performance:
Take the original kmpc_omp_task_begin_if0 and rename it kmpc_omp_task_begin_if0_internal, mark it to be inlined, and add an extra argument:
void __kmpc_omp_task_begin_if0_internal(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task, bool WITH_OMPT) { // in this function, take out the ompt_enabled.enabled checks and replace then with if (WITH_OMPT) { ... } }
Then, declare two routines
void kmpc_omp_task_begin_if0_ompt(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task) { kmpc_omp_task_begin_if0_internal(loc_ref, gtid, task, true); }
__kmpc_omp_task_begin_if0_noompt
void kmpc_omp_task_begin_if0_noompt(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task) { kmpc_omp_task_begin_if0_internal(loc_ref, gtid, task, false); }
Then add a new __kmpc_omp_task_begin_if0 routine:
void __kmpc_omp_task_begin_if0_ompt(ident_t loc_ref, kmp_int32 gtid, kmp_task_t task) { kmpc_omp_task_begin_if0_internal_fn(loc_ref, gtid, task); }
Then, after a tool may initialize OMPT, set the function pointer: kmpc_omp_task_begin_if0_internal_fn = (ompt_enabled.enabled ? __kmpc_omp_task_begin_if0_ompt : __kmpc_omp_task_begin_if0_noompt);
The cost of the indirect call through the function pointer kmpc_omp_task_begin_if0_internal_fn might be less than the cost of evaluating the if (UNLIKELY(ompt_enabled.enabled)) { ...} every time if OMPT is not enabled.
If that helps, then repeat for other important routines, including kmpc_omp_task_complete_if0, kmpc_omp_taskwait.
You left out the issue of getting codeptr and frame address. There is no reliable approach to get these values if there is a runtime function call on the way. The only way I would see is to make __kmpc_omp_task_begin_if0 a pointer and set the pointer according to ompt_enabled
I verified that commit https://github.com/OpenMPToolsInterface/LLVM-openmp/pull/30/commits/40bbdc6c9e11e0a3e4e691fa26df77180fab2b2a fixed performance issue with kdtree spec2012 benchmark. On 48-core IVB, 96-core BDW and 64-core Xeon Phi performance of spec2012 benchmarks is the same with and without OMPT. Testing was run on sources as of Sept 18.
New templated implementation of the fix is availabe in branch template-versioning-inline (https://github.com/OpenMPToolsInterface/LLVM-openmp/tree/template-versioning-inline). From maintainability point of view, this solution would be the preferred.
@omalyshe can you verify that this solution has about the same performance behavior?
Sure, I will run performance tests.
I ran SPECOMP 2012 on 48-core IVB, 96-core BDW and 64-core Xeon Phi. I compared libomp built from original code without ompt versus templated code with ompt.
There is no issue if libomp is built with clang 6.0. kdtree shows -2.97% perf degradation but this is in acceptable range. This slight slowdown is introduced by ompt in the templated version; the template itself doesn’t cause any degradation (orig_no-ompt is the same as templ_no-ompt).
However, if libomp is built with ICC 17.0 kdtree shows -5.15% perf degr on IVB. Both the template and ompt add to slowdown: templ_no-ompt vs. orig_no-ompt gives -2.91% and templ-ompt vs templ-no-ompt gives -2.30. We will probably make further internal investigation with ICC. For now I think this issue can be closed.
I compared performance of libomp without OMPT and with OMPT enabled (under "enabled" I mean just building with LIBOMP_OMPT_SUPPORT=on; without using an ompt tool). Sources were used as of 20170718. The following benchmarks were used - EPCC microbenchmarks, SPEC OMP2012 benchmarks and tasking suite based on BOTS. Testes were run on IVB 48-cores and BDW 96-cores. For some benchmarks (they all use tasking) performance of OMPT-enabled library is significantly (7-13%) worse than performance of the library without OMPT.
This might be a serious problem for compiler vendors to enable OMPT in their OpenMP libraries.