cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

Test wf 20234.99 frequently failing in HGCal validation #27417

Closed fabiocos closed 5 years ago

fabiocos commented 5 years ago

In recent days, starting with CMSSW_11_0_X_2019-06-27-1100, test wf 20234.99 (old scenario D17 with PU) has started to frequently fail with a TimeOut error:

%MSG-e ExcessiveTime:  HGCalValidator:hgcalValidator 30-Jun-2019 20:00:56 CEST  Run: 1 Event: 109
ExcessiveTime: Module used 2551.57 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.
%MSG

This issues appears after the integration of #27337, where the HGCal validation code was updated. Even if the module does not fail, one should compare the TimeReport before this update

TimeReport   3.735532     3.735532     3.735532  hgcalValidator

with that for instance in CMSSW_11_0_X_2019-06-30-0000:

TimeReport 2106.703824  2106.703824  2106.703824  hgcalValidator

This problem should be urgently addressed.

cmsbuild commented 5 years ago

A new Issue was created by @fabiocos Fabio Cossutti.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fabiocos commented 5 years ago

assign dqm,upgrade

@apsallid FYI

cmsbuild commented 5 years ago

New categories assigned: dqm,upgrade

@jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima,@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

fabiocos commented 5 years ago

@kpedro88 we have few Phase2 tests with PU so far in the IB:

10:39 cmsdev25 1435> runTheMatrix.py -n | grep ^2 | grep D[0-9][0-9] | grep PU
20234.99 TTbar_14TeV_TuneCUETP8M1_2023D17PU_GenSimHLBeamSpotFull14+SingleNuE10_cf_2023D17PU_PremixHLBeamSpotFull14PU_Premix+DigiFullTriggerPUPRMXCombined_2023D17PU+RecoFullGlobalPUPRMX_2023D17PU+HARVESTFullGlobalPU_2023D17PU 
20261.97 SingleNuE10_cf_2023D17PU_PremixHLBeamSpotFullPU_Premix 
27634.21 TTbar_14TeV_TuneCUETP8M1_2023D35PU_GenSimHLBeamSpotFull14+DigiFullTriggerPU_2023D35PU+RecoFullGlobalPU_ProdLike_2023D35PU+MiniAODFullGlobalPU_ProdLike_2023D35PU 

and in 27634.21 (scenario D35) as far as I can see the hgcalValidation does not seem to run:

TimeReport   0.002477     0.002477     0.002477  hfreco
TimeReport   0.166499     0.166499     0.166499  hgcalDigis
TimeReport   0.869086     0.869086     0.869086  hgcalLayerClusters
TimeReport   0.530712     0.530712     0.530712  hgcalMultiClusters
TimeReport   0.030309     0.030309     0.030309  hgcalTrackCollection
TimeReport   0.003030     0.003030     0.003030  highPtTripletStepClusters

Is this desired/done on purpose? I would imagine it is not...

kpedro88 commented 5 years ago

@fabiocos yes, DQM is deliberately disabled in the 27634.21 "prod-like" workflow. This one is designed to be used for profiling (but we haven't finished setting up the other components yet).

fabiocos commented 5 years ago

@kpedro88 ok, because that workflow is meant to mimic the prod workflow. I would at this point suggest to have a more modern scenario anyway where to probe validation with PU

kpedro88 commented 5 years ago

I'll update that in the course of the big cleanup (when all the WF numbers will change anyway)

fabiocos commented 5 years ago

Running the Timing Service on 10 events of workflow 20234.99:

TimeReport> Time report complete in 21245.9 seconds
 Time Summary: 
 - Min event:   1486.45
 - Max event:   3422.01
 - Avg event:   2103.35
 - Total loop:  21096.9
 - Total init:  142.586
 - Total job:   21245.9
 - EventSetup Lock:   0.000298262
 - EventSetup Get:   137.195
 Event Throughput: 0.000474004 ev/s
 CPU Summary: 
 - Total loop:  21053.2
 - Total init:  121.809
 - Total extra: 0
 - Total job:   21180.6
 Processing Summary: 
 - Number of Events:  10
 - Number of Global Begin Lumi Calls:  1
 - Number of Global Begin Run Calls: 1

where

16:58 cmsdev25 2013> grep TimeModule cmsExec_step4_RAW2DIGI_L1Reco_RECO_RECOSIM_PAT_VALIDATION_DQM_20190702_160700.log  | grep hgcalValidator
TimeModule> 107 1 hgcalValidator HGCalValidator 1341.72
TimeModule> 101 1 hgcalValidator HGCalValidator 2242.74
TimeModule> 105 1 hgcalValidator HGCalValidator 3162.35
TimeModule> 106 1 hgcalValidator HGCalValidator 1629.87
TimeModule> 104 1 hgcalValidator HGCalValidator 1812.17
TimeModule> 108 1 hgcalValidator HGCalValidator 1569.94
TimeModule> 109 1 hgcalValidator HGCalValidator 2921.56
TimeModule> 102 1 hgcalValidator HGCalValidator 1458.72
TimeModule> 103 1 hgcalValidator HGCalValidator 1751.13
TimeModule> 115 1 hgcalValidator HGCalValidator 1339.1

and

TimeReport ------ Modules in End-Path: validation_step8 ---[Real sec]----
TimeReport  per event    per visit  Name
TimeReport   0.136016     0.136016  caloparticlevalidation
TimeReport 1922.930542  1922.930542  hgcalValidator

Running igprof PERF_TICKS analysis on the first event (taking forever):

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
            6.5  ..........    1'238.65 / 1'238.65      edm::WorkerT<edm::global::EDAnalyzerBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) [37]
[38]        6.5    1'238.65        0.00 / 1'238.65    edm::global::EDAnalyzerBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*)
            6.5  ..........    1'234.27 / 1'234.27      HGCalValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, HGCalValidatorHistograms const&) const [39]
            0.0  ..........        4.26 / 4.26          MultiTrackValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, MultiTrackValidatorHistograms const&) const [359]
            0.0  ..........        0.11 / 0.11          CaloParticleValidation::dqmAnalyze(edm::Event const&, edm::EventSetup const&, std::unordered_map<int, Histogram_CaloParticleSingle, std::hash<int>, std::equal_
to<int>, std::allocator<std::pair<int const, Histogram_CaloParticleSingle> > > const&) const [4189]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
            6.5  ..........    1'234.27 / 1'238.65      edm::global::EDAnalyzerBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) [38]
[39]        6.5    1'234.27        0.00 / 1'234.27    HGCalValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, HGCalValidatorHistograms const&) const
            6.5  ..........    1'231.13 / 1'231.13      HGVHistoProducerAlgo::fill_multi_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, std::vector<reco::HGCalMultiCluster, std::allocator<reco::HGCalMult
iCluster> > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, unsigned i
nt) const [40]
            0.0  ..........        3.03 / 3.03          HGVHistoProducerAlgo::fill_generic_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, std::vector<reco::CaloCluster, std::allocator<reco::CaloCluster> 
> const&, std::map<DetId, float, std::less<DetId>, std::allocator<std::pair<DetId const, float> > > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::les
s<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, std::map<double, double, std::less<double>, std::allocator<std::pair<double const, double> > >, unsigned int, std::vector<int, std::alloc
ator<int> >) const [457]
            0.0  ..........        0.08 / 0.08          HGCalValidator::fillHitMap(std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > >&, edm::SortedCollect
ion<HGCRecHit, edm::StrictWeakOrdering<HGCRecHit> > const&, edm::SortedCollection<HGCRecHit, edm::StrictWeakOrdering<HGCRecHit> > const&, edm::SortedCollection<HGCRecHit, edm::StrictWeakOrdering<HGCRecHit> > const&)
 const [4886]
            0.0  ..........        0.02 / 0.11          std::_Rb_tree<DetId, std::pair<DetId const, HGCRecHit const*>, std::_Select1st<std::pair<DetId const, HGCRecHit const*> >, std::less<DetId>, std::allocator<std
::pair<DetId const, HGCRecHit const*> > >::_M_erase(std::_Rb_tree_node<std::pair<DetId const, HGCRecHit const*> >*) [4299]
            0.0  ..........        0.01 / 0.01          HGVHistoProducerAlgo::fill_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, reco::CaloCluster const&) const [8298]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
            6.5  ..........    1'231.13 / 1'234.27      HGCalValidator::dqmAnalyze(edm::Event const&, edm::EventSetup const&, HGCalValidatorHistograms const&) const [39]
[40]        6.5    1'231.13        0.07 / 1'231.05    HGVHistoProducerAlgo::fill_multi_cluster_histos(HGVHistoProducerAlgoHistograms const&, int, std::vector<reco::HGCalMultiCluster, std::allocator<reco::HGCalMultiC
luster> > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, unsigned int
) const
            6.5  ..........    1'230.99 / 1'230.99      HGVHistoProducerAlgo::multiClusters_to_CaloParticles(HGVHistoProducerAlgoHistograms const&, std::vector<reco::HGCalMultiCluster, std::allocator<reco::HGCalMult
iCluster> > const&, std::vector<CaloParticle, std::allocator<CaloParticle> > const&, std::map<DetId, HGCRecHit const*, std::less<DetId>, std::allocator<std::pair<DetId const, HGCRecHit const*> > > const&, unsigned i
nt, std::vector<bool, std::allocator<bool> >) const [41]
            0.0  ..........        0.02 / 0.02          std::__detail::_Map_base<unsigned int, std::pair<unsigned int const, std::vector<unsigned int, std::allocator<unsigned int> > >, std::allocator<std::pair<unsig
ned int const, std::vector<unsigned int, std::allocator<unsigned int> > > >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Defaul
t_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::operator[](unsigned int const&) [7663]
            0.0  ..........        0.01 / 2.47          TH2::Fill(double, double, double) [520]
            0.0  ..........        0.01 / 0.02          std::_Hashtable<int, std::pair<int const, ConcurrentMonitorElement>, std::allocator<std::pair<int const, ConcurrentMonitorElement> >, std::__detail::_Select1st
, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::count(int co
nst&) const [clone .isra.741] [7129]
            0.0  ..........        0.00 / 2.60          TH1::Fill(double, double) [507]
            0.0  ..........        0.00 / 2.03          MonitorElement::Fill(double) [619]
            0.0  ..........        0.00 / 0.10          TProfile::Fill(double, double, double) [4447]
            0.0  ..........        0.00 / 0.01          hgcal::RecHitTools::zside(DetId const&) const [9372]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
fabiocos commented 5 years ago

@apsallid the problem looks clearly located in https://cmssdt.cern.ch/lxr/source/Validation/HGCalValidation/src/HGVHistoProducerAlgo.cc#1644 Is there any progress? Otherwise I think it makes sense to temporarily revert #27337, as it is now it looks unmanageable on samples with PU

apsallid commented 5 years ago

@fabiocos Unfortunately, I am away and I cannot look for the next few days into it. My suspicion is that this is due to the current multicluster algorithm which could easily produce above 100 multiclusters from a simple single photon particle gun of 35 GeV. I imaging what this could do with PU. With TICL multiclusters i personally expect that this should not happen (@rovere , @felicepantaleo , @amartelli ). Again this is without looking at all at the specific wf with PU, which I would do first thing next Monday morning.

fabiocos commented 5 years ago

@apsallid for what I can see, the time is spent into the final loop over caloparticles starting in https://cmssdt.cern.ch/lxr/source/Validation/HGCalValidation/src/HGVHistoProducerAlgo.cc#2077

apsallid commented 5 years ago

@fabiocos Sorry for this late reply. Thank you for pointing the exact point that produced the problem. We have found a solution and I will very soon make a pull request.

fabiocos commented 5 years ago

the failing workflow has been removed by #27449 and the D41PU workflow now present is ProdLike, so no hgcalValidator. Anyway it was clear that there was an issue (that may be we might now miss...)

kpedro88 commented 5 years ago

The new workflow 20634.99 is in the IB tests and still suffers the same problem. We can test that workflow with the fix PR #27478.

fabiocos commented 5 years ago

@kpedro88 yes, that wf is based on premix instead of classical mixing, but for the practical purpose of this test it behaves in the same way. Anyway using the --what upgrade selection I have made a private test with wf 20634.0 , and it shows that the time of the hgcalValidator is drastically reduced down to a few seconds:

13:43 cmsdev25 2883> grep hgcalV cmsExec_step3_RAW2DIGI_L1Reco_RECO_RECOSIM_PAT_VALIDATION_DQM_PU_20190712_132352.log 
TimeModule> 1 1 hgcalValidator HGCalValidator 4.88448
TimeModule> 2 1 hgcalValidator HGCalValidator 2.83317
TimeModule> 3 1 hgcalValidator HGCalValidator 3.63473
kpedro88 commented 5 years ago

+upgrade

fabiocos commented 5 years ago

The problem looks to have been finally solved by #27478