Closed jpata closed 3 years ago
A new Issue was created by @jpata Joosep Pata.
@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
I agree with @dan131riley that https://github.com/cms-sw/cmssw/pull/32804 is the likely culprit (given that we saw TBB to "move stack between threads", I'd easily believe IgProf to get confused by that, among others).
@jpata Is this specific to pp? Or in other words, do mp profiles work properly?
As far as I can tell, the memory profiling output did not change qualitatively from pre4 to pre5:
The “moving between threads” should also have been stopped when I later attend the use of a tbb::task_arena call.
I poked around between 11_2_X and 11_3_X IgProf reports and noticed something
In CMSSW_11_2_X_2021-03-30-2300 23434.21
But in CMSSW_11_3_X_2021-03-30-2300 23434.21
Could it be that in 11_3_X the 10_EndOfJob
files somehow get overridden by the 1
files? @gartung Could you take a look?
In the GEN-SIM job (TTbar_14TeV_TuneCP5...
) the 1
and 10_EndOfJob
results look different, in step4 they look identical.
We observe the same issue with igprof pp in the reco profiling, which is run independently with different scripts. From pre5, they don't look meaningful.
Thanks Joosep. Perhaps the symptom should be interpreted then as time accounting effectively being lost some time before the capture point of 1
profiles, and nothing being accumulated after that.
It's still curious that in the GEN-SIM job time does get added after capture point of 1
profile.
The output is completely controlled by the the runTheMatrix.py switch --profile prof which enables igprof as a service. https://github.com/cms-sw/cms-bot/blob/9cab720ce0ad62820694846d1153b03f48d21ebe/run-ib-igprof#L19 For the memory profiling the commands are generated with --dryRun and then cmsRunGlibC is used instead.
The output name of the sqlite files is derived from the igprof service output name here https://github.com/cms-sw/cms-bot/blob/9cab720ce0ad62820694846d1153b03f48d21ebe/run-ib-igprof#L41
Luckily I kept the original gzipped tarfile output from igprof. I you click on Logs from the igprof top page you get a directory listing like this where you can download the original igprof output. https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_11_3_X_2021-03-30-2300/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/
If Igprof is doing something strange you should be able to investigate with the original output.
curiously, pp for run3 makes sense in pre5 (main
is close to 100%)
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/11834.21/step3/cpu_endjob
although I'm less sure about the test in the IB https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-29-2300/slc7_amd64_gcc900/profiling/11834.21/igprofCPU_step3
The run-ib-profining job run on cms-oc-gpu-01 while the run-ib-igprof job is run on cmsprofile-01 or cmsprofile-02 which are 4 core VM's. The release-run-reco-profling jobs are run on vocms11 which is a 32 core bare metal machine. Since Igprof is using sampling some of the calls might be missed on cms-oc-gpu-01 because of single thread frequency scaling.
The main sum from the run-ib-igprof jobs run on cmsprofile-01 is even lower -- 89% https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-29-2300/slc7_amd64_gcc900/pp/11834.21_TTbar_14TeV+2021PU_ProdLike+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoPU+MiniAODPU+Nano/step3___70f2a3ea22adee383702bb184802c396___10_EndOfJob
I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons.
I would like to move the run-ib-profiling job and the run-ib-igprof jobs to vocms11 so that the hardware is the same when making comparisons.
@gartung No problem for reco side, I think! Feel free to do what you need in jenkins. Just not to side-track this issue here, we can discuss in mattermost in case of need.
Apparently we still observe stacks being moved between threads https://github.com/cms-sw/cmssw/issues/33289#issuecomment-813397608, I would guess that to confuse IgProf.
I'm running some tests now that get the thread-id at the top of main()
and assert at then end that it hasn't changed. No hits so far, so if this is happening I don't think it is common enough to account for the igrpof changes.
I noticed that the runTheMatrix.py --profile generates the command
igprof -t cmsRun
which is to contrain which process is profiled.
The scripts in the cms-cmpwg/profiling repo do not use this option.
If I compare the profile produced with the -t option
to the profile produced without the -t option
The percentage for main is higher although the tbb threads still have a higher percentage.
I think I might know what is causing the time difference. Could this call to rdtsc https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L188 be interfering with this call to rdtsc https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/macros.h#L70 The assembly instruction rdtsc loads the current value of the processor's time-stamp counter into the EDX:EAX registers. https://en.wikipedia.org/wiki/Time_Stamp_Counter This is used by igprof to get the number of tics https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile-calls.cc#L70
rdtsc is used by TBB in prolonged_pause_impl which calls machine_time_stamp which calls rdtsc. https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L197 and in prolonged_pause https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L224
Definition os machine_time_stamp here https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/scheduler_common.h#L188
In igprof the ticks are returned by the add function which is used to add the time in various memory allocation calls https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile-calls.cc#L58
I think changing the rdtsc calls to libc's clock calls could avoid the problem https://www.gnu.org/software/libc/manual/html_node/CPU-Time.html
I tried this patch with igprof and generated the text output https://gartung.web.cern.ch/gartung/igprof.cmsRun.24441.1618006530.571135.txt The main function still only shows 60% cumulative. Looking down the file to “Call tree profile (cumulative)” I see
Call tree profile (cumulative)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Rank % total Self Self / Children Function
[1] 100.0 87.76 0.00 / 87.76 <spontaneous>
66.0 ......... 57.91 / 57.91 _start [4]
33.4 ......... 29.29 / 29.29 __start_context [20]
0.5 ......... 0.41 / 0.41 clone [723]
0.1 ......... 0.10 / 0.13 PFDisplacedVertexCandidateFinder::associate(std::_List_iterator<edm::RefToBase<reco::Track> >, std::_List_iterator<edm::RefToBase<reco::Track> >, reco::PFDisplacedVertexCandidate&) [clone .localalias] [1362]
0.0 ......... 0.01 / 0.01 @{libz.so.1.2.8+78012} [4193]
0.0 ......... 0.00 / 0.00 @?0x33706574{<dynamically-generated>} [5825]
0.0 ......... 0.00 / 0.00 @{libz.so.1.2.8+77964} [5826]
0.0 ......... 0.00 / 0.00 @{libz.so.1.2.8+77995} [5827]
0.0 ......... 0.00 / 0.00 @{libz.so.1.2.8+77999} [5828]
0.0 ......... 0.00 / 0.00 @{libz.so.1.2.8+78007} [5829]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.0 ......... 0.00 / 0.25 tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&) [988]
0.0 ......... 0.00 / 0.36 edm::eventsetup::synchronousEventSetupForInstance(edm::IOVSyncValue const&, tbb::task_group&, edm::eventsetup::EventSetupsController&) [806]
0.0 ......... 0.03 / 0.03 tbb::internal::task_group_base::wait() [3225]
33.4 ......... 29.29 / 29.29 tbb::internal::co_local_wait_for_all(void*) [21]
37.2 ......... 32.66 / 32.66 edm::EventProcessor::processLumis(std::shared_ptr<void> const&) [11]
[2] 70.6 62.00 0.00 / 62.00 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)
70.6 ......... 62.00 / 62.00 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long) [3]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
70.6 ......... 62.00 / 62.00 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]
[3] 70.6 62.00 0.00 / 62.00 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop(tbb::internal::context_guard_helper<false>&, tbb::task*, long)
35.7 ......... 31.36 / 31.36 tbb::internal::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute() [16]
34.7 ......... 30.45 / 30.45 tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute() [19]
0.2 ......... 0.16 / 0.16 tbb::internal::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute() [1221]
0.0 ......... 0.01 / 0.01 tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() [4779]
0.0 ......... 0.01 / 0.01 tbb::internal::function_task<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() [4789]
0.0 ......... 0.00 / 0.00 tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, tbb::internal::parallel_for_body<std::function<void (unsigned int)>, unsigned int>, tbb::auto_partitioner const>::execute() [7092]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
66.0 ......... 57.91 / 87.76 <spontaneous> [1]
[4] 66.0 57.91 0.00 / 57.91 _start
66.0 ......... 57.91 / 57.91 __libc_start_main [5]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
66.0 ......... 57.91 / 57.91 _start [4]
[5] 66.0 57.91 0.00 / 57.91 __libc_start_main
66.0 ......... 57.91 / 57.91 main [6]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
66.0 ......... 57.91 / 57.91 __libc_start_main [5]
[6] 66.0 57.91 0.00 / 57.91 main
66.0 ......... 57.91 / 57.91 main::{lambda()#1}::operator()() const [7]
and lower down at [20] I see the __start_context where the “missing” cumulative total from spontaneous goes and the main::lambda eventually appears
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
33.4 ......... 29.29 / 87.76 <spontaneous> [1]
[20] 33.4 29.29 0.00 / 29.29 __start_context
33.4 ......... 29.29 / 29.29 tbb::internal::co_local_wait_for_all(void*) [21]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Rank % total Self Self / Children Function
33.4 ......... 29.29 / 29.29 __start_context [20]
[21] 33.4 29.29 0.00 / 29.29 tbb::internal::co_local_wait_for_all(void*)
33.4 ......... 29.29 / 62.00 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
24.8 ......... 21.78 / 55.16 tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const [9]
[22] 24.8 21.78 0.00 / 21.78 edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy)
24.8 ......... 21.78 / 21.78 edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) [23]
A similar breakdown of cumulative percentages can be seen in this text output https://cmssdt.cern.ch/SDT/jenkins-artifacts/igprof/CMSSW_11_3_X_2021-04-09-1100/slc7_amd64_gcc900/profiling/23434.21/RES_CPU_step3.txt
The difference seems to be that igprof-analyze or igprof-navigator to not show spontaneous as a valid base node.
The function tbb::internal::co_local_wait_for_all
is not connected to main in call stacks and needs to be added to main to get the total.
Searching co_local_wait_for_all in the cmssw repo shows 4 issues where co_local_wait_for_all appears 2 or third from the bottom of stack traces. It appears after __start_context which is why it is not linked to main, https://github.com/cms-sw/cmssw/search?q=co_local_wait_for_all&type=issues
Besides __start_context
there is a second glibc function that connects to co_local_wait_for_all
__correctly_grouped_prefixwc
For instance _start_context
calls __correctly_grouped_prefixwc
which calls co_local_wait_for_all
http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/23434.21/step3/cpu_endjob/12
which calls tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)
http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/cgi-bin/igprof-navigator/releases/CMSSW_11_3_0_pre5/23434.21/step3/cpu_endjob/3
I found a second instance of local_wait_for_all in the text report as indicated by '2 added to the end
0.0 ......... 0.00 / 0.00 tbb::interface7::internal::delegated_function<ROOT::TThreadExecutor::ParallelFor(unsigned int, unsigned int, unsigned int, std::function<void (unsigned int)> const&)::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const [7472]
0.3 ......... 0.25 / 0.25 tbb::interface7::internal::delegated_function<HGCalCLUEAlgoT<HGCalLayerTilesT<HGCalTilesConstants> >::makeClusters()::{lambda()#1} const, void>::operator()() const [998]
[988] 0.3 0.25 0.00 / 0.25 tbb::internal::generic_scheduler::local_spawn_root_and_wait(tbb::task*, tbb::task*&)
0.3 ......... 0.25 / 0.25 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*)'2 [1000]
0.0 ......... 0.00 / 62.32 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) [2]
which does not appear in the sql3 database used by igprof0-navigator.
Perf report also shows less than 100%
perf text report
+ 61.18% 0.00% cmsRun libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop
+ 60.35% 0.00% cmsRun libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all
+ 53.37% 0.00% cmsRun libFWCoreFramework.so [.] edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTra
+ 53.32% 0.00% cmsRun libFWCoreFramework.so [.] edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >
+ 53.15% 0.00% cmsRun libFWCoreFramework.so [.] edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >
+ 52.45% 0.00% cmsRun libFWCoreFramework.so [.] edm::EventProcessor::processLumis
+ 52.31% 0.00% cmsRun libFWCoreFramework.so [.] edm::EventProcessor::runToCompletion
+ 50.64% 0.00% cmsRun libtbb_preview.so.2 [.] tbb::interface7::internal::task_arena_base::internal_execute
+ 50.54% 0.00% cmsRun cmsRun [.] tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()
+ 48.71% 0.00% cmsRun cmsRun [.] main::{lambda()#1}::operator()
+ 34.86% 0.00% cmsRun libFWCoreFramework.so [.] edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute
+ 34.34% 0.00% cmsRun libFWCoreConcurrency.so [.] tbb::internal::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute
+ 34.26% 0.00% cmsRun libFWCoreFramework.so [.] edm::stream::EDProducerAdaptorBase::doEvent
+ 34.01% 0.00% cmsRun libFWCoreFramework.so [.] edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo
+ 26.60% 0.00% cmsRun libFWCoreConcurrency.so [.] tbb::internal::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute
+ 21.57% 0.00% cmsRun cmsRun [.] main
+ 19.05% 0.00% cmsRun libc-2.17.so [.] __libc_start_main
+ 18.15% 0.00% cmsRun libIOPoolOutput.so [.] edm::RootOutputFile::fillBranches
+ 18.15% 0.00% cmsRun libTree.so [.] TTree::Fill
+ 18.15% 0.00% cmsRun libFWCoreFramework.so [.] edm::WorkerT<edm::one::OutputModuleBase>::implDo
+ 18.15% 0.00% cmsRun libFWCoreFramework.so [.] edm::one::OutputModuleBase::doEvent
+ 18.15% 0.00% cmsRun libIOPoolOutput.so [.] edm::PoolOutputModule::write
+ 18.15% 0.00% cmsRun libIOPoolOutput.so [.] edm::RootOutputFile::writeOne
+ 18.15% 0.00% cmsRun libtbb_preview.so.2 [.] tbb::interface7::internal::isolate_within_arena
+ 18.15% 0.00% cmsRun libIOPoolOutput.so [.] edm::RootOutputTree::fillTree
+ 18.15% 0.00% cmsRun libFWCoreFramework.so [.] edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm:
+ 18.15% 0.00% cmsRun libFWCoreFramework.so [.] edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute
+ 18.00% 0.00% cmsRun libCore.so [.] R__zipMultipleAlgorithm
+ 18.00% 0.00% cmsRun libCore.so [.] R__zipLZMA
+ 17.98% 0.00% cmsRun liblzma.so.5.2.4 [.] lzma_code
+ 17.98% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded255d9c4
+ 17.92% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded255beda
+ 17.92% 0.00% cmsRun libTree.so [.] TBasket::WriteBuffer
+ 17.90% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded2564a11
+ 17.89% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded256fb6a
+ 17.55% 0.00% cmsRun libImt.so [.] ROOT::Experimental::TTaskGroup::Wait
+ 17.55% 0.00% cmsRun libImt.so [.] tbb::internal::function_task<std::function<void ()> >::execute
+ 17.55% 0.00% cmsRun libTree.so [.] std::_Function_handler<void (), void ROOT::Internal::TBranchIMTHelper::Run<TBranch::WriteBasketImpl(TBasket*, int, ROOT::Internal::TBranchIMTHelp
+ 15.88% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded25675ed
+ 15.68% 0.00% cmsRun cmsRun [.] _start
+ 8.44% 0.00% cmsRun libRecoTrackerCkfPattern.so [.] cms::CkfTrackCandidateMakerBase::produceBase
+ 8.28% 0.01% cmsRun libRecoTrackerCkfPattern.so [.] cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()
+ 7.28% 0.00% cmsRun libtbb_preview.so.2 [.] tbb::internal::co_local_wait_for_all
+ 7.27% 0.00% cmsRun libc-2.17.so [.] 0x00007fded11e21df
+ 5.78% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded25658d2
+ 5.21% 0.03% cmsRun libDetectorDescriptionCore.so [.] DDFilteredView::next
+ 5.06% 0.01% cmsRun pluginRecoTrackerCkfPatternPlugins.so [.] GroupedCkfTrajectoryBuilder::groupedLimitedCandidates
+ 5.03% 0.07% cmsRun pluginRecoTrackerCkfPatternPlugins.so [.] GroupedCkfTrajectoryBuilder::advanceOneLayer
+ 5.02% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded2569dbc
+ 4.59% 0.09% cmsRun libRecoPixelVertexingPixelTriplets.so [.] CAHitQuadrupletGenerator::hitNtuplets
+ 4.57% 0.00% cmsRun pluginRecoPixelVertexingPixelTripletsPlugins.so [.] CAHitNtupletEDProducerT<CAHitQuadrupletGenerator>::produce
+ 4.29% 3.57% cmsRun libRecoPixelVertexingPixelTriplets.so [.] CellularAutomaton::createAndConnectCells
+ 4.16% 0.05% cmsRun pluginRecoTrackerCkfPatternPlugins.so [.] TrajectorySegmentBuilder::segments
+ 4.01% 0.00% cmsRun pluginRecoTrackerCkfPatternPlugins.so [.] GroupedCkfTrajectoryBuilder::buildTrajectories
+ 3.98% 0.00% cmsRun liblzma.so.5.2.4 [.] 0x00007fded2566701
+ 3.97% 1.21% cmsRun libRecoEgammaEgammaElectronAlgos.so [.] PixelHitMatcher::operator()
+ 3.87% 0.01% cmsRun pluginTrackingToolsTrackFittersPlugins.so [.] (anonymous namespace)::KFFittingSmoother::fitOne
+ 3.77% 0.00% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<Ge
+ 3.77% 0.00% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_
+ 3.77% 0.00% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_delete<GeometricTimingDet> >, IdealGeom
+ 3.77% 0.00% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] edm::convertException::wrap<edm::eventsetup::Callback<MTDGeometricTimingDetESModule, std::unique_ptr<GeometricTimingDet, std::default_delete<Geom
+ 3.77% 0.00% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] MTDGeometricTimingDetESModule::produce
+ 3.77% 0.00% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] DDCmsMTDConstruction::construct
+ 3.75% 0.39% cmsRun pluginGeometryMTDNumberingBuilderPlugins.so [.] DDNameFilter::accept
+ 3.66% 0.12% cmsRun libTrackingToolsMaterialEffects.so [.] PropagatorWithMaterial::propagateWithPath
+ 3.47% 0.00% cmsRun pluginRecoTrackerTrackProducerPlugins.so [.] GsfTrackProducer::produce
+ 3.33% 0.00% cmsRun libFWCoreFramework.so [.] edm::Factory::makeModule
+ 3.33% 0.00% cmsRun libFWCoreFramework.so [.] edm::ModuleRegistry::getModule
The fun fact is that using the TBB test executable test_resumable_tasks.exe I get similar Igprof data
and the perf data seems to be missing 30%
+ 67.35% 0.12% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all
+ 56.81% 30.57% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task
+ 53.80% 0.22% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop
+ 47.91% 0.06% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::co_local_wait_for_all
+ 41.77% 0.21% test_resumable_ test_resumable_tasks.exe [.] tbb::internal::function_task<TestCleanupMaster()::{lambda(int)#1}::operator()(int) const::{lambda()#1}>::execute
+ 39.77% 0.05% test_resumable_ libtbb_preview.so.2 [.] sched_yield@plt
+ 32.23% 0.01% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::market::process
+ 28.30% 0.04% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::arena::process
+ 24.77% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f794f92
+ 20.00% 20.00% test_resumable_ [kernel.kallsyms] [k] sysret_check
+ 18.87% 0.02% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::rml::private_worker::run
+ 11.67% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f795095
+ 10.35% 0.34% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::internal_suspend
+ 9.25% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f0d9d9c
+ 9.24% 0.16% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::concurrent_queue_base_v3::internal_pop
+ 8.35% 0.00% test_resumable_ test_resumable_tasks.exe [.] AsyncActivity::asyncLoop
+ 8.33% 0.01% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::generic_scheduler::create_worker
+ 8.14% 0.26% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::micro_queue::pop
+ 8.11% 8.11% test_resumable_ [kernel.kallsyms] [k] system_call_after_swapgs
+ 8.03% 0.08% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::concurrent_queue_base_v3::internal_insert_item
+ 6.94% 0.01% test_resumable_ test_resumable_tasks.exe [.] tbb::internal::suspend_callback<SuspendBody>
+ 6.71% 0.02% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler
+ 6.30% 0.03% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::generic_scheduler::generic_scheduler
+ 6.24% 0.01% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::create_coroutine
+ 6.13% 0.07% test_resumable_ test_resumable_tasks.exe [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<ParForSuspendBody, int>, tbb::affinity_partitioner>::execute
+ 6.07% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f786aaa
+ 5.89% 0.15% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::micro_queue::push
+ 5.78% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f794fd8
+ 5.69% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f794eb9
+ 4.86% 0.02% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::arena::free_arena
+ 4.78% 0.02% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::~custom_scheduler
+ 4.76% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1fcbf2
+ 4.75% 0.00% test_resumable_ libtbb_preview.so.2 [.] munmap@plt
+ 4.64% 0.00% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::generic_scheduler::destroy
+ 4.00% 0.00% test_resumable_ libtbb_preview.so.2 [.] syscall@plt
+ 3.90% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f78655d
+ 3.90% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f397597
+ 3.67% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f0e9942
+ 3.59% 0.00% test_resumable_ libtbb_preview.so.2 [.] swapcontext@plt
+ 3.58% 0.03% test_resumable_ test_resumable_tasks.exe [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<TestSuspendResume()::{lambda(int)#1}::operator()(int) const::{lambda(int)#1}, int>, tbb::auto_partitioner const>::exe
+ 3.50% 0.00% test_resumable_ libpthread-2.17.so [.] start_thread
+ 3.32% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f0daefb
+ 3.04% 0.00% test_resumable_ [unknown] [k] 0xffffffff8f115270
+ 2.68% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1fb955
+ 2.52% 0.00% test_resumable_ libtbb_preview.so.2 [.] mmap@plt
+ 2.51% 0.00% test_resumable_ libc-2.17.so [.] __clone
+ 2.50% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f031f12
+ 2.47% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1fa8c6
+ 2.43% 0.00% test_resumable_ libtbb_preview.so.2 [.] mprotect@plt
+ 2.43% 0.00% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::generic_scheduler::local_spawn_root_and_wait
+ 2.42% 0.00% test_resumable_ [unknown] [k] 0xffffffff8f1fe933
+ 2.39% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1fb715
+ 2.36% 0.18% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::concurrent_monitor::notify_relaxed<tbb::internal::predicate_leq>
+ 2.30% 1.09% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::generic_scheduler::steal_task
+ 2.22% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f787229
+ 2.16% 0.00% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::generic_scheduler::do_post_resume_action
+ 2.10% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f795176
+ 2.09% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f788805
2.04% 2.04% test_resumable_ [kernel.kallsyms] [k] osq_lock
+ 1.95% 0.01% test_resumable_ test_resumable_tasks.exe [.] tbb::interface9::internal::start_for<tbb::blocked_range<int>, tbb::internal::parallel_for_body<TestSuspendResume()::{lambda(int)#1}, int>, tbb::auto_partitioner const>::execute
+ 1.93% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1179e5
+ 1.93% 0.01% test_resumable_ libtbb_preview.so.2 [.] tbb::internal::market::arena_in_need
+ 1.90% 0.02% test_resumable_ libtbb_preview.so.2 [.] tbb::spin_rw_mutex_v3::internal_acquire_reader
+ 1.89% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1ef933
+ 1.88% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f3975cb
+ 1.85% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1ef80f
+ 1.85% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1f9124
+ 1.84% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f0cad35
+ 1.83% 0.00% test_resumable_ [unknown] [.] 0xffffffff8f1edd46
The [k] in the perf output indicates a kernel call. As far as I know igprof does not show kernel calls.
Currently rebuilding CMSSW_11_3_0_pre6 with oneTBB 2021.2. Tests of igprof with the TBB test executable test_resumable_tasks show that main still does not account for 100% cumulative time, but summing _clone and _start does.
We'll see what the TBB developers say https://github.com/oneapi-src/oneTBB/issues/384
Using Google perftools profiler library I am able to get a profile which seems to account for all activity. The pprof tools interprets the data collected and can produce text output or data suitable for producing a flamegraph.
TBB resumable tasks uses makecontext and swapcontext with coroutines. Igprof does not seem to be able to handle correctly. Rather than try to make Igprof work with these coroutines it might be easier to make pprof output Igprof compatible data.
Text output with template arguments not stripped: https://gartung.web.cern.ch/gartung/cmsRun.prof.cumul.templargs.txt.gz
When the coroutine is create the parent link is set to 0 https://github.com/oneapi-src/oneTBB/blob/9e15720bc7744f85dff611d34d65e9099e077da4/src/tbb/co_context.h#L176 When the coroutine is swapped back the thread exits after running the task. (edited) The IgProf threadWrapper sees the task return and disables the profiling timer. https://github.com/igprof/igprof/blob/03d16c15bd970256affcfde9e510a8404856b233/src/profile.cc#L835 Gperftools does not ever disables the profiling timer on the main thread.
This patch for igprof seems to resolve the undercounting problem. https://github.com/cms-externals/igprof/pull/2
Memory profiling looks more accurate after the patch https://gartung.web.cern.ch/gartung/cgi-bin/igprof-navigator/igprofMEM_step3/cumulative
solved thanks to @gartung!
There seems to be an issue with igprof pp (CPU profiling) in 11_3_X between pre4 and pre5, starting at least from March 14. The stack in 11_3_X is no longer capturing main properly and the cumulative number is not meaningful. @slava77 pointed it out to me yesterday. Things look fine in 11_2_X.
11_3_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_3_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob 11_2_X: https://cmssdt.cern.ch/SDT/cgi-bin/igprof-navigator/CMSSW_11_2_X_2021-03-14-0000/slc7_amd64_gcc900/pp/23434.21_TTbar_14TeV+2026D49PU_ProdLike+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTriggerPU+RecoGlobalPU+MiniAODPU/step3___580bd4d7c9a1fca4d329fb1738ee817d___10_EndOfJob
In mattermost, Daniel Sherman Riley pointed to this as a possible culprit: cms-sw/cmssw#32804
Perhaps something needs to be adjusted on the igprof side, but it'd be nice to get the profiles working again.