Closed fabiocos closed 5 years ago
A new Issue was created by @fabiocos Fabio Cossutti.
@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign dqm,upgrade
@apsallid FYI
New categories assigned: dqm,upgrade
@jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima,@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks
@kpedro88 we have few Phase2 tests with PU so far in the IB:
10:39 cmsdev25 1435> runTheMatrix.py -n | grep ^2 | grep D[0-9][0-9] | grep PU
20234.99 TTbar_14TeV_TuneCUETP8M1_2023D17PU_GenSimHLBeamSpotFull14+SingleNuE10_cf_2023D17PU_PremixHLBeamSpotFull14PU_Premix+DigiFullTriggerPUPRMXCombined_2023D17PU+RecoFullGlobalPUPRMX_2023D17PU+HARVESTFullGlobalPU_2023D17PU
20261.97 SingleNuE10_cf_2023D17PU_PremixHLBeamSpotFullPU_Premix
27634.21 TTbar_14TeV_TuneCUETP8M1_2023D35PU_GenSimHLBeamSpotFull14+DigiFullTriggerPU_2023D35PU+RecoFullGlobalPU_ProdLike_2023D35PU+MiniAODFullGlobalPU_ProdLike_2023D35PU
and in 27634.21 (scenario D35) as far as I can see the hgcalValidation does not seem to run:
TimeReport 0.002477 0.002477 0.002477 hfreco
TimeReport 0.166499 0.166499 0.166499 hgcalDigis
TimeReport 0.869086 0.869086 0.869086 hgcalLayerClusters
TimeReport 0.530712 0.530712 0.530712 hgcalMultiClusters
TimeReport 0.030309 0.030309 0.030309 hgcalTrackCollection
TimeReport 0.003030 0.003030 0.003030 highPtTripletStepClusters
Is this desired/done on purpose? I would imagine it is not...
@fabiocos yes, DQM is deliberately disabled in the 27634.21 "prod-like" workflow. This one is designed to be used for profiling (but we haven't finished setting up the other components yet).
@kpedro88 ok, because that workflow is meant to mimic the prod workflow. I would at this point suggest to have a more modern scenario anyway where to probe validation with PU
I'll update that in the course of the big cleanup (when all the WF numbers will change anyway)
Running the Timing Service on 10 events of workflow 20234.99:
TimeReport> Time report complete in 21245.9 seconds
Time Summary:
- Min event: 1486.45
- Max event: 3422.01
- Avg event: 2103.35
- Total loop: 21096.9
- Total init: 142.586
- Total job: 21245.9
- EventSetup Lock: 0.000298262
- EventSetup Get: 137.195
Event Throughput: 0.000474004 ev/s
CPU Summary:
- Total loop: 21053.2
- Total init: 121.809
- Total extra: 0
- Total job: 21180.6
Processing Summary:
- Number of Events: 10
- Number of Global Begin Lumi Calls: 1
- Number of Global Begin Run Calls: 1
where
16:58 cmsdev25 2013> grep TimeModule cmsExec_step4_RAW2DIGI_L1Reco_RECO_RECOSIM_PAT_VALIDATION_DQM_20190702_160700.log | grep hgcalValidator
TimeModule> 107 1 hgcalValidator HGCalValidator 1341.72
TimeModule> 101 1 hgcalValidator HGCalValidator 2242.74
TimeModule> 105 1 hgcalValidator HGCalValidator 3162.35
TimeModule> 106 1 hgcalValidator HGCalValidator 1629.87
TimeModule> 104 1 hgcalValidator HGCalValidator 1812.17
TimeModule> 108 1 hgcalValidator HGCalValidator 1569.94
TimeModule> 109 1 hgcalValidator HGCalValidator 2921.56
TimeModule> 102 1 hgcalValidator HGCalValidator 1458.72
TimeModule> 103 1 hgcalValidator HGCalValidator 1751.13
TimeModule> 115 1 hgcalValidator HGCalValidator 1339.1
and
TimeReport ------ Modules in End-Path: validation_step8 ---[Real sec]----
TimeReport per event per visit Name
TimeReport 0.136016 0.136016 caloparticlevalidation
TimeReport 1922.930542 1922.930542 hgcalValidator
Running igprof PERF_TICKS analysis on the first event (taking forever):
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6.5 .......... 1'238.65 / 1'238.65 edm::WorkerT<edm::global::EDAnalyzerBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) [37]
[38] 6.5 1'238.65 0.00 / 1'238.65 edm::global::EDAnalyzerBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*)
6.5 .......... 1'234.27 / 1'234.27 HGCalValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, HGCalValidatorHistograms const&) const [39]
0.0 .......... 4.26 / 4.26 MultiTrackValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, MultiTrackValidatorHistograms const&) const [359]
0.0 .......... 0.11 / 0.11 CaloParticleValidation::dqmAnalyze(edm::Event const&, edm::EventSetup const&, std::unordered_map<int, Histogram_CaloParticleSingle, std::hash<int>, std::equal_
to<int>, std::allocator<std::pair<int const, Histogram_CaloParticleSingle> > > const&) const [4189]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6.5 .......... 1'234.27 / 1'238.65 edm::global::EDAnalyzerBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) [38]
[39] 6.5 1'234.27 0.00 / 1'234.27 HGCalValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, HGCalValidatorHistograms const&) const
6.5 .......... 1'231.13 / 1'231.13 HGVHistoProducerAlgo::fill_multi_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, std::vector<reco::HGCalMultiCluster, std::allocator<reco::HGCalMult
iCluster> > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, unsigned i
nt) const [40]
0.0 .......... 3.03 / 3.03 HGVHistoProducerAlgo::fill_generic_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, std::vector<reco::CaloCluster, std::allocator<reco::CaloCluster>
> const&, std::map<DetId, float, std::less<DetId>, std::allocator<std::pair<DetId const, float> > > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::les
s<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, std::map<double, double, std::less<double>, std::allocator<std::pair<double const, double> > >, unsigned int, std::vector<int, std::alloc
ator<int> >) const [457]
0.0 .......... 0.08 / 0.08 HGCalValidator::fillHitMap(std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > >&, edm::SortedCollect
ion<HGCRecHit, edm::StrictWeakOrdering<HGCRecHit> > const&, edm::SortedCollection<HGCRecHit, edm::StrictWeakOrdering<HGCRecHit> > const&, edm::SortedCollection<HGCRecHit, edm::StrictWeakOrdering<HGCRecHit> > const&)
const [4886]
0.0 .......... 0.02 / 0.11 std::_Rb_tree<DetId, std::pair<DetId const, HGCRecHit const*>, std::_Select1st<std::pair<DetId const, HGCRecHit const*> >, std::less<DetId>, std::allocator<std
::pair<DetId const, HGCRecHit const*> > >::_M_erase(std::_Rb_tree_node<std::pair<DetId const, HGCRecHit const*> >*) [4299]
0.0 .......... 0.01 / 0.01 HGVHistoProducerAlgo::fill_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, reco::CaloCluster const&) const [8298]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6.5 .......... 1'231.13 / 1'234.27 HGCalValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, HGCalValidatorHistograms const&) const [39]
[40] 6.5 1'231.13 0.07 / 1'231.05 HGVHistoProducerAlgo::fill_multi_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, std::vector<reco::HGCalMultiCluster, std::allocator<reco::HGCalMultiC
luster> > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, unsigned int
) const
6.5 .......... 1'230.99 / 1'230.99 HGVHistoProducerAlgo::multiClusters_to_CaloParticles(HGVHistoProducerAlgoHistograms const&, std::vector<reco::HGCalMultiCluster, std::allocator<reco::HGCalMult
iCluster> > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, unsigned i
nt, std::vector<bool, std::allocator<bool> >) const [41]
0.0 .......... 0.02 / 0.02 std::__detail::_Map_base<unsigned int, std::pair<unsigned int const, std::vector<unsigned int, std::allocator<unsigned int> > >, std::allocator<std::pair<unsig
ned int const, std::vector<unsigned int, std::allocator<unsigned int> > > >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Defaul
t_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](unsigned int const&) [7663]
0.0 .......... 0.01 / 2.47 TH2::Fill(double, double, double) [520]
0.0 .......... 0.01 / 0.02 std::_Hashtable<int, std::pair<int const, ConcurrentMonitorElement>, std::allocator<std::pair<int const, ConcurrentMonitorElement> >, std::__detail::_Select1st
, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::count(int co
nst&) const [clone .isra.741] [7129]
0.0 .......... 0.00 / 2.60 TH1::Fill(double, double) [507]
0.0 .......... 0.00 / 2.03 MonitorElement::Fill(double) [619]
0.0 .......... 0.00 / 0.10 TProfile::Fill(double, double, double) [4447]
0.0 .......... 0.00 / 0.01 hgcal::RecHitTools::zside(DetId const&) const [9372]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
@apsallid the problem looks clearly located in https://cmssdt.cern.ch/lxr/source/Validation/HGCalValidation/src/HGVHistoProducerAlgo.cc#1644 Is there any progress? Otherwise I think it makes sense to temporarily revert #27337, as it is now it looks unmanageable on samples with PU
@fabiocos Unfortunately, I am away and I cannot look for the next few days into it. My suspicion is that this is due to the current multicluster algorithm which could easily produce above 100 multiclusters from a simple single photon particle gun of 35 GeV. I imaging what this could do with PU. With TICL multiclusters i personally expect that this should not happen (@rovere , @felicepantaleo , @amartelli ). Again this is without looking at all at the specific wf with PU, which I would do first thing next Monday morning.
@apsallid for what I can see, the time is spent into the final loop over caloparticles starting in https://cmssdt.cern.ch/lxr/source/Validation/HGCalValidation/src/HGVHistoProducerAlgo.cc#2077
@fabiocos Sorry for this late reply. Thank you for pointing the exact point that produced the problem. We have found a solution and I will very soon make a pull request.
the failing workflow has been removed by #27449 and the D41PU workflow now present is ProdLike, so no hgcalValidator. Anyway it was clear that there was an issue (that may be we might now miss...)
The new workflow 20634.99 is in the IB tests and still suffers the same problem. We can test that workflow with the fix PR #27478.
@kpedro88 yes, that wf is based on premix instead of classical mixing, but for the practical purpose of this test it behaves in the same way. Anyway using the --what upgrade
selection I have made a private test with wf 20634.0 , and it shows that the time of the hgcalValidator is drastically reduced down to a few seconds:
13:43 cmsdev25 2883> grep hgcalV cmsExec_step3_RAW2DIGI_L1Reco_RECO_RECOSIM_PAT_VALIDATION_DQM_PU_20190712_132352.log
TimeModule> 1 1 hgcalValidator HGCalValidator 4.88448
TimeModule> 2 1 hgcalValidator HGCalValidator 2.83317
TimeModule> 3 1 hgcalValidator HGCalValidator 3.63473
+upgrade
The problem looks to have been finally solved by #27478
In recent days, starting with CMSSW_11_0_X_2019-06-27-1100, test wf 20234.99 (old scenario D17 with PU) has started to frequently fail with a TimeOut error:
This issues appears after the integration of #27337, where the HGCal validation code was updated. Even if the module does not fail, one should compare the TimeReport before this update
with that for instance in CMSSW_11_0_X_2019-06-30-0000:
This problem should be urgently addressed.