cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

igprof pp profiling outputs no longer work #33297

Closed jpata closed 3 years ago

jpata commented 3 years ago

There seems to be an issue with igprof pp (CPU profiling) in 11_3_X between pre4 and pre5, starting at least from March 14. The stack in 11_3_X is no longer capturing main properly and the cumulative number is not meaningful. @slava77 pointed it out to me yesterday. Things look fine in 11_2_X.

11_3_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob 11_2_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_2_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob

In mattermost, Daniel Sherman Riley pointed to this as a possible culprit: cms-sw/cmssw#32804

Perhaps something needs to be adjusted on the igprof side, but it'd be nice to get the profiles working again.

cmsbuild commented 3 years ago

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 years ago

assign core

cmsbuild commented 3 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 3 years ago

I agree with @dan131riley that https://github.com/cms-sw/cmssw/pull/32804 is the likely culprit (given that we saw TBB to "move stack between threads", I'd easily believe IgProf to get confused by that, among others).

makortel commented 3 years ago

@jpata Is this specific to pp? Or in other words, do mp profiles work properly?

jpata commented 3 years ago

As far as I can tell, the memory profiling output did not change qualitatively from pre4 to pre5:

Dr15Jones commented 3 years ago

The “moving between threads” should also have been stopped when I later attend the use of a tbb::task_arena call.

makortel commented 3 years ago

I poked around between 11_2_X and 11_3_X IgProf reports and noticed something

In CMSSW_11_2_X_2021-03-30-2300 23434.21

But in CMSSW_11_3_X_2021-03-30-2300 23434.21

Could it be that in 11_3_X the 10_EndOfJob files somehow get overridden by the 1 files? @gartung Could you take a look?

In the GEN-SIM job (TTbar_14TeV_TuneCP5...) the 1 and 10_EndOfJob results look different, in step4 they look identical.

jpata commented 3 years ago

We observe the same issue with igprof pp in the reco profiling, which is run independently with different scripts. From pre5, they don't look meaningful.

makortel commented 3 years ago

Thanks Joosep. Perhaps the symptom should be interpreted then as time accounting effectively being lost some time before the capture point of 1 profiles, and nothing being accumulated after that.

It's still curious that in the GEN-SIM job time does get added after capture point of 1 profile.

gartung commented 3 years ago

The output is completely controlled by the the runTheMatrix.py switch --profile prof which enables igprof as a service. https://github.com/cms-sw/cms-bot/blob/9cab720ce0ad62820694846d1153b03f48d21ebe/run-ib-igprof#L19 For the memory profiling the commands are generated with --dryRun and then cmsRunGlibC is used instead.

gartung commented 3 years ago

The output name of the sqlite files is derived from the igprof service output name here https://github.com/cms-sw/cms-bot/blob/9cab720ce0ad62820694846d1153b03f48d21ebe/run-ib-igprof#L41

gartung commented 3 years ago

Luckily I kept the original gzipped tarfile output from igprof. I you click on Logs from the igprof top page you get a directory listing like this where you can download the original igprof output. https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_11_3_X_2021-03-30-2300/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/

gartung commented 3 years ago

If Igprof is doing something strange you should be able to investigate with the original output.

slava77 commented 3 years ago

curiously, pp for run3 makes sense in pre5 (main is close to 100%) https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/11834.21/step3/cpu_endjob

although I'm less sure about the test in the IB https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-29-2300/slc7_amd64_gcc900/profiling/11834.21/igprofCPU_step3

gartung commented 3 years ago

The run-ib-profining job run on cms-oc-gpu-01 while the run-ib-igprof job is run on cmsprofile-01 or cmsprofile-02 which are 4 core VM's. The release-run-reco-profling jobs are run on vocms11 which is a 32 core bare metal machine. Since Igprof is using sampling some of the calls might be missed on cms-oc-gpu-01 because of single thread frequency scaling.

gartung commented 3 years ago

The main sum from the run-ib-igprof jobs run on cmsprofile-01 is even lower -- 89% https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-29-2300/slc7_amd64_gcc900/pp/11834.21_TTbar_14TeV+2021PU_ProdLike+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoPU+MiniAODPU+Nano/step3___70f2a3ea22adee383702bb184802c396___10_EndOfJob

gartung commented 3 years ago

I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons.

jpata commented 3 years ago

I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons.

@gartung No problem for reco side, I think! Feel free to do what you need in jenkins. Just not to side-track this issue here, we can discuss in mattermost in case of need.

makortel commented 3 years ago

Apparently we still observe stacks being moved between threads https://github.com/cms-sw/cmssw/issues/33289#issuecomment-813397608, I would guess that to confuse IgProf.

dan131riley commented 3 years ago

I'm running some tests now that get the thread-id at the top of main() and assert at then end that it hasn't changed. No hits so far, so if this is happening I don't think it is common enough to account for the igrpof changes.

gartung commented 3 years ago

I noticed that the runTheMatrix.py --profile generates the command igprof -t cmsRun which is to contrain which process is profiled. The scripts in the cms-cmpwg/profiling repo do not use this option. If I compare the profile produced with the -t option to the profile produced without the -t option The percentage for main is higher although the tbb threads still have a higher percentage.

gartung commented 3 years ago

I think I might know what is causing the time difference. Could this call to rdtsc https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L188 be interfering with this call to rdtsc https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/macros.h#L70 The assembly instruction rdtsc loads the current value of the processor's time-stamp counter into the EDX:EAX registers. https://en.wikipedia.org/wiki/Time_Stamp_Counter This is used by igprof to get the number of tics https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile-calls.cc#L70

gartung commented 3 years ago

rdtsc is used by TBB in prolonged_pause_impl which calls machine_time_stamp which calls rdtsc. https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L197 and in prolonged_pause https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L224

gartung commented 3 years ago

Definition os machine_time_stamp here https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L188

gartung commented 3 years ago

In igprof the ticks are returned by the add function which is used to add the time in various memory allocation calls https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile-calls.cc#L58

gartung commented 3 years ago

I think changing the rdtsc calls to libc's clock calls could avoid the problem https://www.gnu.org/software/libc/manual/html_node/CPU-Time.html

gartung commented 3 years ago

I tried this patch with igprof and generated the text output https://gartung.web.cern.ch/gartung/igprof.cmsRun.24441.1618006530.571135.txt The main function still only shows 60% cumulative. Looking down the file to “Call tree profile (cumulative)” I see

Call tree profile (cumulative)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Rank    % total       Self       Self / Children   Function
[1]       100.0      87.76       0.00 / 87.76      <spontaneous>
           66.0  .........      57.91 / 57.91        _start [4]
           33.4  .........      29.29 / 29.29        __start_context [20]
            0.5  .........       0.41 / 0.41         clone [723]
            0.1  .........       0.10 / 0.13         PFDisplacedVertexCandidateFinder::associate(std::_List_iterator<edm::RefToBase<reco::Track> >, std::_List_iterator<edm::RefToBase<reco::Track> >, reco::PFDisplacedVertexCandidate&) [clone .localalias] [1362]
            0.0  .........       0.01 / 0.01         @{libz.so.1.2.8+78012} [4193]
            0.0  .........       0.00 / 0.00         @?0x33706574{<dynamically-generated>} [5825]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+77964} [5826]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+77995} [5827]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+77999} [5828]
            0.0  .........       0.00 / 0.00         @{libz.so.1.2.8+78007} [5829]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
            0.0  .........       0.00 / 0.25         tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&) [988]
            0.0  .........       0.00 / 0.36         edm::eventsetup::synchronousEventSetupForInstance(edm::IOVSyncValue const&, tbb::task_group&, edm::eventsetup::EventSetupsController&) [806]
            0.0  .........       0.03 / 0.03         tbb::internal::task_group_base::wait() [3225]
           33.4  .........      29.29 / 29.29        tbb::internal::co_local_wait_for_all(void*) [21]
           37.2  .........      32.66 / 32.66        edm::EventProcessor::processLumis(std::shared_ptr<void> const&) [11]
[2]        70.6      62.00       0.00 / 62.00      tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)
           70.6  .........      62.00 / 62.00        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long) [3]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           70.6  .........      62.00 / 62.00        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]
[3]        70.6      62.00       0.00 / 62.00      tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long)
           35.7  .........      31.36 / 31.36        tbb::internal::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute() [16]
           34.7  .........      30.45 / 30.45        tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() [19]
            0.2  .........       0.16 / 0.16         tbb::internal::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute() [1221]
            0.0  .........       0.01 / 0.01         tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() [4779]
            0.0  .........       0.01 / 0.01         tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() [4789]
            0.0  .........       0.00 / 0.00         tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::execute() [7092]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           66.0  .........      57.91 / 87.76        <spontaneous> [1]
[4]        66.0      57.91       0.00 / 57.91      _start
           66.0  .........      57.91 / 57.91        __libc_start_main [5]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           66.0  .........      57.91 / 57.91        _start [4]
[5]        66.0      57.91       0.00 / 57.91      __libc_start_main
           66.0  .........      57.91 / 57.91        main [6]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           66.0  .........      57.91 / 57.91        __libc_start_main [5]
[6]        66.0      57.91       0.00 / 57.91      main
           66.0  .........      57.91 / 57.91        main::{lambda()#1}::operator()() const [7]

and lower down at [20] I see the __start_context where the “missing” cumulative total from spontaneous goes and the main::lambda eventually appears

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           33.4  .........      29.29 / 87.76        <spontaneous> [1]
[20]       33.4      29.29       0.00 / 29.29      __start_context
           33.4  .........      29.29 / 29.29        tbb::internal::co_local_wait_for_all(void*) [21]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Rank    % total       Self       Self / Children   Function
           33.4  .........      29.29 / 29.29        __start_context [20]
[21]       33.4      29.29       0.00 / 29.29      tbb::internal::co_local_wait_for_all(void*)
           33.4  .........      29.29 / 62.00        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
           24.8  .........      21.78 / 55.16        tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const [9]
[22]       24.8      21.78       0.00 / 21.78      edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy)
           24.8  .........      21.78 / 21.78        edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) [23]
gartung commented 3 years ago

A similar breakdown of cumulative percentages can be seen in this text output https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_11_3_X_2021-04-09-1100/slc7_amd64_gcc900/profiling/23434.21/RES_CPU_step3.txt

The difference seems to be that igprof-analyze or igprof-navigator to not show spontaneous as a valid base node.

gartung commented 3 years ago

The function tbb::internal::co_local_wait_for_all is not connected to main in call stacks and needs to be added to main to get the total.

gartung commented 3 years ago

Searching co_local_wait_for_all in the cmssw repo shows 4 issues where co_local_wait_for_all appears 2 or third from the bottom of stack traces. It appears after __start_context which is why it is not linked to main, https://github.com/cms-sw/cmssw/search?q=co_local_wait_for_all&type=issues

gartung commented 3 years ago

Besides __start_context there is a second glibc function that connects to co_local_wait_for_all __correctly_grouped_prefixwc

gartung commented 3 years ago

For instance _start_context calls __correctly_grouped_prefixwc which calls co_local_wait_for_all http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/23434.21/step3/cpu_endjob/12 which calls tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/23434.21/step3/cpu_endjob/3

gartung commented 3 years ago

I found a second instance of local_wait_for_all in the text report as indicated by '2 added to the end

            0.0  .........       0.00 / 0.00         tbb::interface7::internal::delegated_function<ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&)::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const [7472]
            0.3  .........       0.25 / 0.25         tbb::interface7::internal::delegated_function<HGCalCLUEAlgoT<HGCalLayerTilesT<HGCalTilesConstants> >::makeClusters()::{lambda()#1} const, void>::operator()() const [998]
[988]       0.3       0.25       0.00 / 0.25       tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&)
            0.3  .........       0.25 / 0.25         tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)'2 [1000]
            0.0  .........       0.00 / 62.32        tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]

which does not appear in the sql3 database used by igprof0-navigator.

gartung commented 3 years ago

Perf report also shows less than 100%

image

gartung commented 3 years ago

perf text report

+   61.18%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop
+   60.35%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all
+   53.37%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTra
+   53.32%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >
+   53.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >
+   52.45%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::EventProcessor::processLumis
+   52.31%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::EventProcessor::runToCompletion
+   50.64%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::interface7::internal::task_arena_base::internal_execute
+   50.54%     0.00%  cmsRun           cmsRun                                                          [.] tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()
+   48.71%     0.00%  cmsRun           cmsRun                                                          [.] main::{lambda()#1}::operator()
+   34.86%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute
+   34.34%     0.00%  cmsRun           libFWCoreConcurrency.so                                         [.] tbb::internal::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute
+   34.26%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::stream::EDProducerAdaptorBase::doEvent
+   34.01%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo
+   26.60%     0.00%  cmsRun           libFWCoreConcurrency.so                                         [.] tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute
+   21.57%     0.00%  cmsRun           cmsRun                                                          [.] main
+   19.05%     0.00%  cmsRun           libc-2.17.so                                                    [.] __libc_start_main
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::RootOutputFile::fillBranches
+   18.15%     0.00%  cmsRun           libTree.so                                                      [.] TTree::Fill
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::WorkerT<edm::one::OutputModuleBase>::implDo
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::one::OutputModuleBase::doEvent
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::PoolOutputModule::write
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::RootOutputFile::writeOne
+   18.15%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::interface7::internal::isolate_within_arena
+   18.15%     0.00%  cmsRun           libIOPoolOutput.so                                              [.] edm::RootOutputTree::fillTree
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm:
+   18.15%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute
+   18.00%     0.00%  cmsRun           libCore.so                                                      [.] R__zipMultipleAlgorithm
+   18.00%     0.00%  cmsRun           libCore.so                                                      [.] R__zipLZMA
+   17.98%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] lzma_code
+   17.98%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded255d9c4
+   17.92%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded255beda
+   17.92%     0.00%  cmsRun           libTree.so                                                      [.] TBasket::WriteBuffer
+   17.90%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded2564a11
+   17.89%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded256fb6a
+   17.55%     0.00%  cmsRun           libImt.so                                                       [.] ROOT::Experimental::TTaskGroup::Wait
+   17.55%     0.00%  cmsRun           libImt.so                                                       [.] tbb::internal::function_task<std::function<void ()> >::execute
+   17.55%     0.00%  cmsRun           libTree.so                                                      [.] std::_Function_handler<void (), void ROOT::Internal::TBranchIMTHelper::Run<TBranch::WriteBasketImpl(TBasket*, int, ROOT::Internal::TBranchIMTHelp
+   15.88%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded25675ed
+   15.68%     0.00%  cmsRun           cmsRun                                                          [.] _start
+    8.44%     0.00%  cmsRun           libRecoTrackerCkfPattern.so                                     [.] cms::CkfTrackCandidateMakerBase::produceBase
+    8.28%     0.01%  cmsRun           libRecoTrackerCkfPattern.so                                     [.] cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()
+    7.28%     0.00%  cmsRun           libtbb_preview.so.2                                             [.] tbb::internal::co_local_wait_for_all
+    7.27%     0.00%  cmsRun           libc-2.17.so                                                    [.] 0x00007fded11e21df
+    5.78%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded25658d2
+    5.21%     0.03%  cmsRun           libDetectorDescriptionCore.so                                   [.] DDFilteredView::next
+    5.06%     0.01%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] GroupedCkfTrajectoryBuilder::groupedLimitedCandidates
+    5.03%     0.07%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] GroupedCkfTrajectoryBuilder::advanceOneLayer
+    5.02%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded2569dbc
+    4.59%     0.09%  cmsRun           libRecoPixelVertexingPixelTriplets.so                           [.] CAHitQuadrupletGenerator::hitNtuplets
+    4.57%     0.00%  cmsRun           pluginRecoPixelVertexingPixelTripletsPlugins.so                 [.] CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce
+    4.29%     3.57%  cmsRun           libRecoPixelVertexingPixelTriplets.so                           [.] CellularAutomaton::createAndConnectCells
+    4.16%     0.05%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] TrajectorySegmentBuilder::segments
+    4.01%     0.00%  cmsRun           pluginRecoTrackerCkfPatternPlugins.so                           [.] GroupedCkfTrajectoryBuilder::buildTrajectories
+    3.98%     0.00%  cmsRun           liblzma.so.5.2.4                                                [.] 0x00007fded2566701
+    3.97%     1.21%  cmsRun           libRecoEgammaEgammaElectronAlgos.so                             [.] PixelHitMatcher::operator()
+    3.87%     0.01%  cmsRun           pluginTrackingToolsTrackFittersPlugins.so                       [.] (anonymous namespace)::KFFittingSmoother::fitOne
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<Ge
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_delete<GeometricTimingDet> >, IdealGeom
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] edm::convertException::wrap<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_delete<Geom
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] MTDGeometricTimingDetESModule::produce
+    3.77%     0.00%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] DDCmsMTDConstruction::construct
+    3.75%     0.39%  cmsRun           pluginGeometryMTDNumberingBuilderPlugins.so                     [.] DDNameFilter::accept
+    3.66%     0.12%  cmsRun           libTrackingToolsMaterialEffects.so                              [.] PropagatorWithMaterial::propagateWithPath
+    3.47%     0.00%  cmsRun           pluginRecoTrackerTrackProducerPlugins.so                        [.] GsfTrackProducer::produce
+    3.33%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::Factory::makeModule
+    3.33%     0.00%  cmsRun           libFWCoreFramework.so                                           [.] edm::ModuleRegistry::getModule
gartung commented 3 years ago

The fun fact is that using the TBB test executable test_resumable_tasks.exe I get similar Igprof data

and the perf data seems to be missing 30%

+   67.35%     0.12%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all
+   56.81%    30.57%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task
+   53.80%     0.22%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop
+   47.91%     0.06%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::co_local_wait_for_all
+   41.77%     0.21%  test_resumable_  test_resumable_tasks.exe  [.] tbb::internal::function_task<TestCleanupMaster()::{lambda(int)#1}::operator()(int) const::{lambda()#1}>::execute
+   39.77%     0.05%  test_resumable_  libtbb_preview.so.2       [.] sched_yield@plt
+   32.23%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::market::process
+   28.30%     0.04%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::arena::process
+   24.77%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f794f92
+   20.00%    20.00%  test_resumable_  [kernel.kallsyms]         [k] sysret_check
+   18.87%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::rml::private_worker::run
+   11.67%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f795095
+   10.35%     0.34%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::internal_suspend
+    9.25%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0d9d9c
+    9.24%     0.16%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::concurrent_queue_base_v3::internal_pop
+    8.35%     0.00%  test_resumable_  test_resumable_tasks.exe  [.] AsyncActivity::asyncLoop
+    8.33%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::create_worker
+    8.14%     0.26%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::micro_queue::pop
+    8.11%     8.11%  test_resumable_  [kernel.kallsyms]         [k] system_call_after_swapgs
+    8.03%     0.08%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::concurrent_queue_base_v3::internal_insert_item
+    6.94%     0.01%  test_resumable_  test_resumable_tasks.exe  [.] tbb::internal::suspend_callback<SuspendBody>
+    6.71%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler
+    6.30%     0.03%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::generic_scheduler
+    6.24%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::create_coroutine
+    6.13%     0.07%  test_resumable_  test_resumable_tasks.exe  [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<ParForSuspendBody, int>, tbb::affinity_partitioner>::execute
+    6.07%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f786aaa
+    5.89%     0.15%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::micro_queue::push
+    5.78%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f794fd8
+    5.69%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f794eb9
+    4.86%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::arena::free_arena
+    4.78%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::~custom_scheduler
+    4.76%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fcbf2
+    4.75%     0.00%  test_resumable_  libtbb_preview.so.2       [.] munmap@plt
+    4.64%     0.00%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::destroy
+    4.00%     0.00%  test_resumable_  libtbb_preview.so.2       [.] syscall@plt
+    3.90%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f78655d
+    3.90%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f397597
+    3.67%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0e9942
+    3.59%     0.00%  test_resumable_  libtbb_preview.so.2       [.] swapcontext@plt
+    3.58%     0.03%  test_resumable_  test_resumable_tasks.exe  [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<TestSuspendResume()::{lambda(int)#1}::operator()(int) const::{lambda(int)#1}, int>, tbb::auto_partitioner const>::exe
+    3.50%     0.00%  test_resumable_  libpthread-2.17.so        [.] start_thread
+    3.32%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0daefb
+    3.04%     0.00%  test_resumable_  [unknown]                 [k] 0xffffffff8f115270
+    2.68%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fb955
+    2.52%     0.00%  test_resumable_  libtbb_preview.so.2       [.] mmap@plt
+    2.51%     0.00%  test_resumable_  libc-2.17.so              [.] __clone
+    2.50%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f031f12
+    2.47%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fa8c6
+    2.43%     0.00%  test_resumable_  libtbb_preview.so.2       [.] mprotect@plt
+    2.43%     0.00%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::local_spawn_root_and_wait
+    2.42%     0.00%  test_resumable_  [unknown]                 [k] 0xffffffff8f1fe933
+    2.39%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1fb715
+    2.36%     0.18%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::concurrent_monitor::notify_relaxed<tbb::internal::predicate_leq>
+    2.30%     1.09%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::steal_task
+    2.22%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f787229
+    2.16%     0.00%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::generic_scheduler::do_post_resume_action
+    2.10%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f795176
+    2.09%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f788805
     2.04%     2.04%  test_resumable_  [kernel.kallsyms]         [k] osq_lock
+    1.95%     0.01%  test_resumable_  test_resumable_tasks.exe  [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<TestSuspendResume()::{lambda(int)#1}, int>, tbb::auto_partitioner const>::execute
+    1.93%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1179e5
+    1.93%     0.01%  test_resumable_  libtbb_preview.so.2       [.] tbb::internal::market::arena_in_need
+    1.90%     0.02%  test_resumable_  libtbb_preview.so.2       [.] tbb::spin_rw_mutex_v3::internal_acquire_reader
+    1.89%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1ef933
+    1.88%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f3975cb
+    1.85%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1ef80f
+    1.85%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1f9124
+    1.84%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f0cad35
+    1.83%     0.00%  test_resumable_  [unknown]                 [.] 0xffffffff8f1edd46
gartung commented 3 years ago

The [k] in the perf output indicates a kernel call. As far as I know igprof does not show kernel calls.

gartung commented 3 years ago

Currently rebuilding CMSSW_11_3_0_pre6 with oneTBB 2021.2. Tests of igprof with the TBB test executable test_resumable_tasks show that main still does not account for 100% cumulative time, but summing _clone and _start does.

gartung commented 3 years ago

We'll see what the TBB developers say https://github.com/oneapi-src/oneTBB/issues/384

gartung commented 3 years ago

Using Google perftools profiler library I am able to get a profile which seems to account for all activity. The pprof tools interprets the data collected and can produce text output or data suitable for producing a flamegraph.

gartung commented 3 years ago

TBB resumable tasks uses makecontext and swapcontext with coroutines. Igprof does not seem to be able to handle correctly. Rather than try to make Igprof work with these coroutines it might be easier to make pprof output Igprof compatible data.

gartung commented 3 years ago

Text output with template arguments not stripped: https://gartung.web.cern.ch/gartung/cmsRun.prof.cumul.templargs.txt.gz

gartung commented 3 years ago

When the coroutine is create the parent link is set to 0 https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/co_context.h#L176 When the coroutine is swapped back the thread exits after running the task. (edited) The IgProf threadWrapper sees the task return and disables the profiling timer. https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile.cc#L835 Gperftools does not ever disables the profiling timer on the main thread.

gartung commented 3 years ago

This patch for igprof seems to resolve the undercounting problem. https://github.com/cms-externals/igprof/pull/2

gartung commented 3 years ago

https://gartung.web.cern.ch/gartung/cgi-bin/igprof-navigator/igprofCPU_step3/cumulative

gartung commented 3 years ago

Memory profiling looks more accurate after the patch https://gartung.web.cern.ch/gartung/cgi-bin/igprof-navigator/igprofMEM_step3/cumulative

jpata commented 3 years ago

solved thanks to @gartung!