Long startup time for simplest workflows

felicepantaleo commented 1 year ago

Developing reconstruction algorithms for Phase-2 is becoming more and more difficult and frustrating as CMSSW startup times have increased in the last few years. This is what happens when we want to reconstruct 10 events of a single charged pion shot in front of HGCAL with a single thread. The workflow is 24896.0_CloseByPGun_CE_E_Front_120um+2026D98 using CMSSW_13_2_0_pre3 It takes 4.5 minutes to launch the job, 30 seconds to reconstruct 10 events with a single thread. Reducing the startup time would reduce drastically the number of coffees we have every day while increasing morale.

$ date && cmsRun step3_RAW2DIGI_RECO_RECOSIM_PAT_VALIDATION_DQM.py 
**Don Okt 19 10:55:37 CEST 2023**
19-Oct-2023 10:57:51 CEST  Initiating request to open file file:step2_16065463_0.root
19-Oct-2023 10:58:06 CEST  Successfully opened file file:step2_16065463_0.root
2023-10-19 10:58:48.203903: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 19-Oct-2023 11:00:50.708 CEST
%MSG-w UndefinedPartonMode:  HadronAndPartonSelector:selectedHadronsAndPartons  19-Oct-2023 11:00:52 CEST Run: 1 Event: 1
Could not automatically determine the hadronizer type and set the correct parton selection mode. Parton-based jet flavour will not be defined.
%MSG
#--------------------------------------------------------------------------
#                         FastJet release 3.4.1
#                 M. Cacciari, G.P. Salam and G. Soyez                  
#     A software package for jet finding and analysis at colliders      
#                           http://fastjet.fr                           
#                                                                             
# Please cite EPJC72(2012)1896 [arXiv:1111.6097] if you use this package
# for scientific work and optionally PLB641(2006)57 [hep-ph/0512210].   
#                                                                       
# FastJet is provided without warranty under the GNU GPL v2 or higher.  
# It uses T. Chan's closest pair algorithm, S. Fortune's Voronoi code
# and 3rd party plugin jet algorithms. See COPYING file for details.
#--------------------------------------------------------------------------
%MSG-w UndefinedPartonMode:  HadronAndPartonSelector:patJetPartons  19-Oct-2023 11:00:52 CEST Run: 1 Event: 1
Could not automatically determine the hadronizer type and set the correct parton selection mode. Parton-based jet flavour will not be defined.
%MSG
%MSG-w UndefinedPartonMode:   HadronAndPartonSelector:selectedHadronsAndPartonsForGenJetsFlavourInfos  19-Oct-2023 11:01:05 CEST Run: 1 Event: 1
Could not automatically determine the hadronizer type and set the correct parton selection mode. Parton-based jet flavour will not be defined.
%MSG
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:05.542 CEST
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:08.168 CEST
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:10.733 CEST
Begin processing the 5th record. Run 1, Event 5, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:13.286 CEST
Begin processing the 6th record. Run 1, Event 6, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:15.435 CEST
Begin processing the 7th record. Run 1, Event 8, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:17.857 CEST
Begin processing the 8th record. Run 1, Event 7, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:20.059 CEST
Begin processing the 9th record. Run 1, Event 9, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:22.279 CEST
Begin processing the 10th record. Run 1, Event 11, LumiSection 1 on stream 0 at 19-Oct-2023 11:01:25.115 CEST

@waredjeb @makortel fyi

cmsbuild commented 1 year ago

A new Issue was created by @felicepantaleo Felice Pantaleo.

@antoniovilela, @makortel, @Dr15Jones, @rappoccio, @smuzaffar, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign core,upgrade

cmsbuild commented 1 year ago

New categories assigned: core,upgrade

@Dr15Jones,@AdrianoDee,@makortel,@smuzaffar,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 1 year ago

So I used my recent changes to the Tracer service to gather statistics on your job. What I found was the following break down

 51s parsing python
  8.4s setting up source
 19.7s constructing the 2200 modules
  0.05s in begin job
  0.05s in begin stream
143s in CondDBESSource to find the conditions for the first Run
  0.0003s source reading begin run
 24s in global Begin Run
    1.8s in dqmCSCClient
    2.1s in XMLIdealGeometryESSource
    0.9s in hcalParameters
    1.6s in EcalBarrelGeometryEP
    0.9s in hgcalEEParametersInitialize
    1.1s in hgcalHESiParametersInitialize
    1.2s in gemGeometry
    2.7s in TrackerAlignmentRcd
    1.0s in ecalMonitorTask
    4.2s in VolumeBasedMagneticFieldESProducer
    1.0s in trackerGeometry

  0.2s in CondDBESSource to find conditions for first lumi

50s in stream begin Run
    48.3s in mtdNumberingGeometry
 0.002s source reading begin lumi
 0.02s in global begin lumi
 0.2s in stream begin lumi
 0.004s source reading first event

317s from start to start processing first event

So the obvious speed ups would be to

use a more efficient configuration file, either from .dumpPython or one where you pickle to results and always re-read from the pickle file.
choose a different way to get your conditions. You could choose https://github.com/cms-sw/cmssw/tree/master/CondCore/CondHDF5ESSource#readme
optimize mtdNumberingGeometry

srimanob commented 1 year ago

Thanks very much @Dr15Jones for analysis. FYI @cms-sw/mtd-dpg-l2 @fabiocos on mtdNumberingGeometry

makortel commented 1 year ago

I investigated the first 3 steps of the workflow with VTune (as @felicepantaleo told me it felt like all the steps have long startup time). On step 1 (GEN-SIM) the job divides clearly into three phases

The first phase is the overall job startup (shown in the top thread), taking about 102 seconds wall clock time, and 85 seconds of CPU time. The CPU time divides into

2 seconds of reading the configuration
19 seconds of constructing the EventProcessor (i.e. module construction and all that)
63 seconds spent in EventSetup
- 40 seconds spent in MuonOffsetESProducer
- 5 seconds in HcalSimParametersESModule
- 4 seconds in VolumeBasedMagneticFieldESProducerFromDB
- 4 seconds in XMLIdealGeometryESSource
- 4 seconds in HGCalParametersESModule
- 3 seconds in EcalSimParametersESModule
- rest is 1 second or below

Here is flame graph of what happens in MuonOffsetESProducer

The question is then, does it really need to be this expensive? @cms-sw/geometry-l2

Second phase (middle thread in the timeline) is the Geant4 initialization in the OscarMTMasterThread, taking about 50 seconds. I don't know if there is anything that could be easily done there, but here is the flame graph of that anyway @cms-sw/simulation-l2

Third phase (mostly in the bottom thread in the timeline) is the actual event processing (most time spent in Geant4, so shows in a separate non-TBB thread), taking about 45 seconds for 10 events.

makortel commented 1 year ago

assign geometry

cmsbuild commented 1 year ago

New categories assigned: geometry

@Dr15Jones,@civanch,@bsunanda,@makortel,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220 second CPU time.

The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116 Sequences, and 157 Tasks.

The startup phase (72 seconds CPU time) divides into

5 seconds of reading configuration
46 seconds in EventProcessor constructor
- 27 seconds in StreamSchedule constructor, i.e. loading libraries, creating modules, registering data products (of which a lot of time is spent in cling), and all that
- 12 seconds spent in Maker::makeModule, of which 8 seconds in creating edm::stream modules. This 8 seconds increases with the number of streams, but is run serially!
  - 3 seconds in L1TPFCaloProducer constructor
  - Most of the time seems to be spent in reading data from a ROOT file via l1tpf::corrector::init_()
  - 2 seconds in DTTrigPhase2Prod constructor
  - Most of the time seems to be spent in GlobalCoordsObtainer::generate_luts()
  - 2 seconds in l1tpf::PFClusterProducerFromHGC3DClusters constructor
  - Most of the time seems to be spent in creating cut parser, and creating expression parser in l1tpf::HGC3DClusterEgID constructor
  - 0.3 seconds in MixingModule constructor, rest is below that
- 1 second constructing the source

@cms-sw/l1-l2 Is there anything that could be done to speed up the constructors of the aforementioned 3 L1T modules?

Next is beginRun transition (including most, if not all, of EventSetup), about 100 seconds of CPU time, divided into

EventSetup
- 58 seconds in MTDGeometricTimingDetESModule::produce() (which @Dr15Jones also noted above)
- 7.0 seconds in getting the CondDB data
- notably 4 seconds for SiPixelGainCalibrationOffline, rest are below 1 second
- 4.3 seconds in magneticfield::VolumeBasedMagneticFieldESProducerFromDB::produce()
- 4.3 seconds in XMLIdealGeometryESSource::produceGeom()
- 4.0 seconds in HGCalParametersESModule::produce()
- 3.3 seconds in TrackerDigiGeometryESModule::produce()
- 2.7 seconds in CaloGeometryEP<EcalBarrelGeometry, DDCompactView>::produceAligned()
- 2.0 seconds in DTGeometryESModule::produce()
- 1.7 seconds in CSCGeometryESModule::produce()
- 1.7 seconds in GEMGeometryESModule::produce()
- 1.5 seconds in HcalParametersESModule::produce()
- 1.2 seconds in HGCalGeometryESProducer::produce()
- 1.1 seconds in HcalTPGCoderULUT::produce()
- 1.0 seconds in RPCGeometryESModule::produce()
- 0.6 seconds in MuonGeometryConstantsESModule::produce()
- rest is smaller
EDModule beginRun
- 3 seconds in trklet::TrackletEventProcessor::init()

Then comes the event data processing, 63 seconds for 10 events

4.6 seconds in OutputModule writing the data
56 seconds in EDModules
- 45 seconds in MixingModule
- 41 seconds in HGCDigiProducer::finalizeEvent()
- 2.7 seconds in HcalDigiProducer::finalizeEvent()
- 0.5 seconds in EcalDigiProducer::finalizeEvent()
- rest:

makortel commented 1 year ago

assign l1

cmsbuild commented 1 year ago

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

The step 3 (RECO+PAT+VALIDATION+DQM) took 290 seconds wall clock time with 260 seconds of CPU time.

The configuration defines 5431(!) EDModules, 483 ESModules, 72 (End)Paths, 1291 Sequences, and 758 Tasks.

The startup phase took about 170 seconds wall clock time (150 seconds CPU time), divided into (in CPU time)

70 seconds in reading the configuration
- 50 seconds is spent in miniAOD_customizeAllMC() customization function
- 32 seconds is spent in toModify()
  - 19 seconds in _add_jetsPuppi()
  - 19 seconds in applyDeepBtagging()
    - 17 seconds in setupBTagging()
    - 16 seconds in addToProcessAndTask()
  - 12 seconds _add_metPuppi()
- 18 seconds is spent in runMetCorAndUncForMiniAODProduction()
- Based on the profile, most of the time seems to be spent in assignments like process.foo = cms.EDModule(...), where the process.foo already exists. In this case the framework goes to replace the module in all Tasks/Sequence/Schedule
  - 35 seconds spent in _replaceInTasks()
  - 12 seconds spent in _replaceInSequences()
  - 1 second is spent in _replaceInSchedule()
  - 1 second is spent in _delattrFromSetattr()
79 seconds in constructing EventProcessor
- 39 seconds in constructing Schedule
- 25 seconds in edm::Factory::makeModule()
  - 15 seconds in constructing EDModules
  - 5 seconds in constructing edm::stream EDModules. This 5 seconds increases with the number of streams, but is run serially!
    - 3 seconds is spent in MuonIdProducer constructor, in reading ROOT file via MuonCaloCompatibility::configure()
  - 9 seconds in loading shared libraries
- 15 seconds in creating EventSetup modules
- 13 seconds in loading the shared libraries
- 13 seconds in creating source

@cms-sw/xpog-l2 I found that the miniAOD_customizeAllMC() being expensive was reported already in https://github.com/cms-sw/cmssw/issues/20220, maybe it would be finally time to speed it up?

The beginRun transition took about 93 seconds of wall clock time (87 seconds CPU time), divided into (in CPU time)

EventSetup
- 52 seconds in MTDGeometricTimingDetESModule::produce() (also noted earlier)
- 4.5 seconds in XMLIdealGeometryESSource::produceGeom()
- 4.4 seconds in magneticfield::VolumeBasedMagneticFieldESProducerFromDB::produce()
- 3.6 seconds in HGCalParametersESModule::produce()
- 2.7 seconds in CaloGeometryEP<EcalBarrelGeometry, DDCompactView>::produceAligned()
- 1.7 seconds in GEMGeometryESModule::produce()
- 1.6 seconds in TrackerDigiGeometryESModule::produce()
- 1.3 seconds in HcalParametersESModule::produce()
- 1.2 seconds in HGCalGeometryESProducer::produce()
- 1.0 seconds in DTGeometryESModule::produce()
- rest are below 1 second
EDModules
- 4.2 seconds in edm::one EDModules
- 2.3 seconds in cscdqm::Dispatcher::book()
- 1.5 seconds in EcalDQMonitorTask::bookHistograms()
- 1.5 seconds in edm::stream EDModules
- 0.7 seconds in edm::global EDModules

Event data processing and shutdown took about 23 seconds in wall clock time (22.5 seconds in CPU time), divided into (in CPU time)

EventSetup
- 6.7 seconds in getting the CondDB data
- 3.7 seconds for SiPixelGainCalibrationOffline in SiPixelGainCalibrationOfflineRcd
- 0.9 seconds for GBRForestD in GBRDWrapperRcd
- 0.7 seconds for GBRForest in GBRWrapperRcd
- rest is less
5.5 seconds in EDModules
- 0.7 seconds in edm::one modules (looks like DQM)
- 4.7 seconds in edm::global and edm::stream modules (i.e. reconstruction etc, nothing seems to really smell here)
3.6 seconds in DelayedReader (i.e. reading input data)
2.5 seconds in output module writing data
2.5 seconds in closing the output file

makortel commented 1 year ago

assign xpog

cmsbuild commented 1 year ago

New categories assigned: xpog

@vlimant,@simonepigazzini you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

assign db

For the cost of

143s in CondDBESSource to find the conditions for the first Run

in https://github.com/cms-sw/cmssw/issues/43062#issuecomment-1771705646

cmsbuild commented 1 year ago

New categories assigned: db

@francescobrivio,@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

+core

I think our part (in the analysis) is done

davidlange6 commented 1 year ago

re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry? [which might be much faster if it were in the db]

Dr15Jones commented 1 year ago

As a test I did

> time python3 -c 'import step3_RAW2DIGI_RECO_RECOSIM_PAT_VALIDATION_DQM'
45.100u 1.567s 0:49.83 93.6%    0+0k 24+0io 0pf+0w

Then in python I did

> python3 -i step3_RAW2DIGI_RECO_RECOSIM_PAT_VALIDATION_DQM.py
>>> import pickle
>>> pickle.dump(process, open('step3.pkl', 'wb'))

and then

[cdj@cmslpc-el8-heavy01 24896.0_CloseByPGun_CE_E_Front_120um+2026D98]$ time python3 -c 'import pickle; process = pickle.load(open("step3.pkl", "rb"));'
1.833u 0.153s 0:02.16 91.6%     0+0k 0+0io 0pf+0w

so using a pickle file is 25x faster

Dr15Jones commented 1 year ago

@davidlange6

re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry?

it also seems that Phase 2 is still using the old DD instead of DD4Hep.

davidlange6 commented 1 year ago

I believe that is true - (so startup will get slower with DD4Hep I guess)

makortel commented 1 year ago

@Dr15Jones asked for the flame graph for MTDGeometricTimingDetESModule::produce()

davidlange6 commented 1 year ago

looking at some of the others putting a theSiPixelGainCalibrationOffline.reserve(detUnitDimensions.first*(detUnitDimensions.second + detUnitDimensions.first/80);

here https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/CalibTracker/SiPixelESProducers/plugins/SiPixelFakeGainOfflineESSource.cc#L52

probably helps a lot (I don't really understand the resize(...) followed by a memcpy going on, so there might be another factor to gain in this code

srimanob commented 1 year ago

@davidlange6

re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry?

it also seems that Phase 2 is still using the old DD instead of DD4Hep.

Do we really gain something in performance with dd4hep? It was also a question I get from CHEP, but I never have a clear result of gaining.

By the way, we are trying to move. Last validation still see few issues, and we plan to do again with coming 13_3_0_pre4

Dr15Jones commented 1 year ago

@srimanob

Do we really gain something in performance with dd4hep?

My take is we'd rather spend time optimizing code using DD4Hep then spend the time on the obsolete DD.

mmusich commented 1 year ago

here cmssw/CalibTracker/SiPixelESProducers/plugins/SiPixelFakeGainOfflineESSource.cc

This should not even be executed in phase2

makortel commented 1 year ago

here cmssw/CalibTracker/SiPixelESProducers/plugins/SiPixelFakeGainOfflineESSource.cc

This should not even be executed in phase2

I confirm that in this workflow the SiPixelGainCalibrationOffline comes from the CondDB via cond::persistency::PayloadProxy<SiPixelGainCalibrationOffline>::loadPayload() etc.

mmusich commented 1 year ago

I confirm that in this workflow the SiPixelGainCalibrationOffline comes from the CondDB

I thought we cleaned that here https://github.com/cms-sw/cmssw/pull/42794

makortel commented 1 year ago

I confirm that in this workflow the SiPixelGainCalibrationOffline comes from the CondDB

I thought we cleaned that here #42794

The reported behavior was in CMSSW_13_2_0_pre3 (the job used 131X_mcRun4_realistic_v6 Global Tag), i.e. before that cleanup PR. So one message to @felicepantaleo and @waredjeb would be to move to more recent CMSSW.

aloeliger commented 1 year ago

The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220 second CPU time.

The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116 Sequences, and 157 Tasks.

The startup phase (72 seconds CPU time) divides into

5 seconds of reading configuration

46 seconds in EventProcessor constructor

27 seconds in StreamSchedule constructor, i.e. loading libraries, creating modules, registering data products (of which a lot of time is spent in cling), and all that

12 seconds spent in Maker::makeModule, of which 8 seconds in creating edm::stream modules. This 8 seconds increases with the number of streams, but is run serially!

3 seconds in L1TPFCaloProducer constructor

Most of the time seems to be spent in reading data from a ROOT file via l1tpf::corrector::init_()

2 seconds in DTTrigPhase2Prod constructor

Most of the time seems to be spent in GlobalCoordsObtainer::generate_luts()

2 seconds in l1tpf::PFClusterProducerFromHGC3DClusters constructor

Most of the time seems to be spent in creating cut parser, and creating expression parser in l1tpf::HGC3DClusterEgID constructor

0.3 seconds in MixingModule constructor, rest is below that

1 second constructing the source

@cms-sw/l1-l2 Is there anything that could be done to speed up the constructors of the aforementioned 3 L1T modules?

@gpetruc

Yours is the name I recall in connection with the correlator team and Phase2L1ParticleFlow (That is correct right? My apologies if not). From my look at the L1TPFCaloProducer code, 3 correctors get produced from a string, an int, and a debug bool: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/plugins/L1TPFCaloProducer.cc#L73-L77

Which calls this constructor here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L34-L38, calling init_ with two strings and two bools here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L52-L83 This is mostly file checks which eventually calls another init_ here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L85-L144 and if the emulation flag is set, it calls another initEmulation_ here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L146-L175

Notably, each of the init functions does some loop through TFile keys, which would be my first instinct to check for slow down in this constructor.

Would it be possible to ask if you or one of the people responsible for Phase2L1ParticleFlow could take a look at this and see if where the slowdown occurs and if it could be reworked?

@NTrevisani

You are listed as having most recently added this line: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/DTTriggerPhase2/plugins/DTTrigPhase2Prod.cc#L197 to DTTriggerPhase2Prod

Which in turn calls a for loop doing a long list of mathematical functions here (listed with the atan calculation): https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/DTTriggerPhase2/src/GlobalCoordsObtainer.cc#L43-L195

which seems a primary candidate for slowdown in this function. Could you, or someone responsible for this code look into the performance of this, and see if it could be reworked? If not, would it maybe be possible to cache or look up the results of these math functions/loops? At a quick pass these seem to be calculated against some constants.

@rekovic & @cerminar

You are listed as authors on PFClusterProducerFromHGC3DClusters and HGC3DClusterEgID. This one I cannot find a likely candidate for slow-down on, other than perhaps the instantiation of the TMVA and the creation of another corrector (see the first section).

Could you, or someone responsible for this code look into the performance of this and see if it could be reworked?

gpetruc commented 1 year ago

Hi,

I can have a look but not with terribly high priority - after all, we're discussing shaving off seconds of initialization, once one runs a workflow with more events it becomes irrelevant. The two PF modules mentioned have to load a bunch of calibration histograms (in several bins of eta and em fraction), and BDTs for cluster identification, so they do need some time. Also, eventually both features will anyway have to be replaced with bitwise-accurate emulators (timescale of some months, at most a year)

Giovanni

Il Ven 20 Ott 2023, 17:54 Andrew Loeliger @.***> ha scritto:

The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220 second CPU time.

The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116 Sequences, and 157 Tasks.

The startup phase (72 seconds CPU time) divides into

-

5 seconds of reading configuration

46 seconds in EventProcessor constructor
  27 seconds in StreamSchedule constructor, i.e. loading libraries,
  creating modules, registering data products (of which a lot of time is
  spent in cling), and all that
  -

     12 seconds spent in Maker::makeModule, of which 8 seconds in
     creating edm::stream modules. *This 8 seconds increases with the
     number of streams, but is run serially!*
     -

        3 seconds in L1TPFCaloProducer constructor
        - Most of the time seems to be spent in reading data from a
           ROOT file via l1tpf::corrector::init_()
        -

        2 seconds in DTTrigPhase2Prod constructor
        - Most of the time seems to be spent in
           GlobalCoordsObtainer::generate_luts()
        -

        2 seconds in l1tpf::PFClusterProducerFromHGC3DClusters
        constructor
        - Most of the time seems to be spent in creating cut parser,
           and creating expression parser in l1tpf::HGC3DClusterEgID
           constructor
        -

        0.3 seconds in MixingModule constructor, rest is below that
        -

  1 second constructing the source
@cms-sw/l1-l2 https://github.com/orgs/cms-sw/teams/l1-l2 Is there anything that could be done to speed up the constructors of the aforementioned 3 L1T modules?

@gpetruc https://github.com/gpetruc

Yours is the name I recall in connection with the correlator team and Phase2L1ParticleFlow (That is correct right? My apologies if not). From my look at the L1TPFCaloProducer code, 3 correctors get produced from a string, an int, and a debug bool: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/plugins/L1TPFCaloProducer.cc#L73-L77

Which calls this constructor here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L34-L38, calling init with two strings and two bools here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L52-L83 This is mostly file checks which eventually calls another init here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L85-L144 and if the emulation flag is set, it calls another initEmulation_ here: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L146-L175

Notably, each of the init functions does some loop through TFile keys, which would be my first instinct to check for slow down in this constructor.

Would it be possible to ask if you or one of the people responsible for Phase2L1ParticleFlow could take a look at this and see if where the slowdown occurs and if it could be reworked?

@NTrevisani https://github.com/NTrevisani

You are listed as having most recently added this line: https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/DTTriggerPhase2/plugins/DTTrigPhase2Prod.cc#L197 to DTTriggerPhase2Prod

Which in turn calls a for loop doing a long list of mathematical functions here (listed with the atan calculation): https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/DTTriggerPhase2/src/GlobalCoordsObtainer.cc#L43-L195

which seems a primary candidate for slowdown in this function. Could you, or someone responsible for this code look into the performance of this, and see if it could be reworked? If not, would it maybe be possible to cache or look up the results of these math functions/loops? At a quick pass these seem to be calculated against some constants.

@rekovic https://github.com/rekovic & @cerminar https://github.com/cerminar

You are listed as authors on PFClusterProducerFromHGC3DClusters and HGC3DClusterEgID. This one I cannot find a likely candidate for slow-down on, other than perhaps the instantiation of the TMVA and the creation of another corrector (see the first section).

Could you, or someone responsible for this code look into the performance of this and see if it could be reworked?

— Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/43062#issuecomment-1772993772, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDNWR65YPAQI4UAPYD6OXTYAKND7AVCNFSM6AAAAAA6G3HLI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZSHE4TGNZXGI . You are receiving this because you were mentioned.Message ID: @.***>

vlimant commented 1 year ago

it is on our to-do list to review how PAT is constructed and simplify it together with dropping miniAOD_customizeAllMC

Dr15Jones commented 1 year ago

As an additional test, I dumped the data associated to the global tag used by the job into an HDF5 file by doing

conddb2hdf5.py --compressor zlib --output 133X_mcRun4_realistic_v1.h5conf 133X_mcRun4_realistic_v1

As the tool only handles full global tags and not individual tags, I changed the cmsRun configuration to read the global tag from the HDF5 file but had the individual tags still come from CondDBESSource. The change to the configuration was the addition of the following lines

process.GlobalTag.globaltag = ""
process.H5Tag = cms.ESSource("CondHDF5ESSource", filename = cms.untracked.string("133X_mcRun4_realistic_v1.h5conf"), globalTag = cms.string("133X_mcRun4_realistic_v1"),
                             excludeRecords = cms.vstring("SiPixelGenErrorDBObjectRcd", "SiPixelLorentzAngleRcd",
                                                                    "SiPixelLorentzAngleSimRcd","SiPixelTemplateDBObjectRcd",
                                                                    "TrackerAlignmentErrorExtendedRcd",
                                                                    "TrackerAlignmentRcd",
                                                                    "TrackerSurfaceDeformationRcd"))

Running the original job at FNAL, it took 140s for the job to retrieve all the IOVs during the begin Run phase. With this change, it now takes 1.5s.

VinInn commented 1 year ago

So for local tests one can first dump the global tag and then run. since which version H5Tag is supported?

Dr15Jones commented 1 year ago

The CondCore/CondHDF5ESSource package was first introduced in CMSSW_13_3_0_pre2. See https://github.com/cms-sw/cmssw/pull/42431.

davidlange6 commented 1 year ago

@Dr15Jones - what is the role of excludeRecords here? Just an ability to skip known to be not needed records? Is that responsible for much of the (very interesting..) speedup

Dr15Jones commented 1 year ago

what is the role of excludeRecords here? Just an ability to skip known to be not needed records? Is that responsible for much of the (very interesting..) speedup

For the configuration given, the CondDBESSource overrides the GlobalTag for those records and gives explicit tags to be used (see below). The CondHDF5ESSource (and its accompanying scripts) only handle using a GlobalTag (they could be extended to handle explicit tags as well) so I needed to still get these explicit tags (and their associated Records) from CondDBESSource. In order to avoid having HDF5 and the DB fight each other over which one would deliver the Record (one could es an ESAlias to explicitly set which one wins) I decided to just have the CondHDF5ESSource not deliver those Record at all.

    toGet = cms.VPSet(
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelGenErrorDBObjectRcd'),
            snapshotTime = cms.string('2021-04-17 20:00:00'),
            tag = cms.string('SiPixelGenErrorDBObject_phase2_IT_v7.0.2_25x100_v2_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelLorentzAngleRcd'),
            snapshotTime = cms.string('2021-03-16 20:00:00.000'),
            tag = cms.string('SiPixelLorentzAngle_phase2_T25_v0_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            label = cms.untracked.string('forWidth'),
            record = cms.string('SiPixelLorentzAngleRcd'),
            snapshotTime = cms.string('2020-02-23 14:00:00.000'),
            tag = cms.string('SiPixelLorentzAngle_phase2_T19_mc_forWidthEmpty')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            label = cms.untracked.string('fromAlignment'),
            record = cms.string('SiPixelLorentzAngleRcd'),
            snapshotTime = cms.string('2020-02-23 14:00:00.000'),
            tag = cms.string('SiPixelLorentzAngle_phase2_T19_mc_forWidthEmpty')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelLorentzAngleSimRcd'),
            snapshotTime = cms.string('2021-03-16 20:00:00.000'),
            tag = cms.string('SiPixelSimLorentzAngle_phase2_T25_v0_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('SiPixelTemplateDBObjectRcd'),
            snapshotTime = cms.string('2021-04-17 20:00:00'),
            tag = cms.string('SiPixelTemplateDBObject_phase2_IT_v7.0.2_25x100_v2_mc')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('TrackerAlignmentErrorExtendedRcd'),
            snapshotTime = cms.string('2023-03-16 15:30:00'),
            tag = cms.string('TrackerAlignmentErrorsExtended_Upgrade2026_T25_design_v0')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('TrackerAlignmentRcd'),
            snapshotTime = cms.string('2023-03-16 15:30:00'),
            tag = cms.string('TrackerAlignment_Upgrade2026_T25_design_v0')
        ),
        cms.PSet(
            connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
            record = cms.string('TrackerSurfaceDeformationRcd'),
            snapshotTime = cms.string('2023-03-16 15:30:00'),
            tag = cms.string('TrackerSurfaceDeformations_Upgrade2026_Zero')
        )
    )

fabiocos commented 1 year ago

for the MTD geometry construction part, please have a look at #43124 . The DD4hep version is definitely less problematic as far as MTD is concerned, but not the production default yet. Profiling 24807.911 with IgProf:

    5.7      20.92  MTDGeometricTimingDetESModule::produce(IdealGeometryRecord const&) [122]
    5.7      20.92  DDCmsMTDConstruction::construct(cms::DDCompactView const&) [123]

fwyzard commented 1 year ago

so using a pickle file is 25x faster

@Dr15Jones I wasn't able to understand... is the pickled file faster than the original configuration build by cmsDriver.py, or faster than the fully expanded one from edmConfigDump ?

Dr15Jones commented 1 year ago

I wasn't able to understand... is the pickled file faster than the original configuration build by cmsDriver.py, or faster than the fully expanded one from edmConfigDump ?

Faster than the original one from cmsDriver.py.

civanch commented 12 months ago

+geometry

it seems that the main problem in geometry connected with MTD is fixed.

Dr15Jones commented 12 months ago

If I use CMSSW_13_3_X_2023-11-08-1100, use a pickled configuration file and use the HDF5 conditions code then for a 8 thread job (which can do some initialization in parallel) it takes 68 seconds to go from starting the job to the first event. This job processed 100 events.

[The image is from the new tracer log viewer application]

cms-sw / cmssw

Long startup time for simplest workflows #43062

5 seconds of reading configuration

46 seconds in EventProcessor constructor