cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.29k forks source link

Nano consumes alignments #39197

Closed tvami closed 1 year ago

tvami commented 2 years ago

The following table was produced to see what records are consumed in which step: https://twiki.cern.ch/twiki/bin/view/CMS/AlCaDBHLT2019#Table_of_conditions_for_2018_MC

As you can see the following records are consumed in the Nano step

Record/Label GEN SIM DIGI L1 DIGI2RAW HLT AOD MINIAOD NANOAOD
CSCAlignmentErrorExtendedRcd / Yes Yes Yes Yes Yes
CSCAlignmentRcd / Yes Yes Yes Yes Yes
CSCRecoDigiParametersRcd / Yes Yes Yes Yes Yes Yes
CSCRecoGeometryRcd / Yes Yes Yes Yes Yes Yes
DTAlignmentErrorExtendedRcd / Yes Yes Yes Yes Yes
DTAlignmentRcd / Yes Yes Yes Yes Yes
DTRecoGeometryRcd / Yes Yes Yes Yes Yes Yes
GEMRecoGeometryRcd / Yes Yes Yes Yes Yes Yes
GlobalPositionRcd / Yes Yes Yes Yes Yes Yes
IdealGeometryRecord / Yes Yes Yes Yes Yes Yes
JetCorrectionsRecord / AK4PF Yes Yes Yes
JetCorrectionsRecord / AK4PFchs Yes Yes
JetCorrectionsRecord / AK8PF Yes Yes Yes
JetCorrectionsRecord / AK8PFPuppi Yes Yes
JetResolutionRcd / AK4PFchs_phi Yes Yes
JetResolutionRcd / AK4PFchs_pt Yes Yes
JetResolutionScaleFactorRcd / AK4PF Yes Yes
MFGeometryFileRcd / 160812 Yes Yes Yes Yes Yes Yes Yes
MagFieldConfigRcd / 3.8T Yes Yes Yes Yes Yes Yes Yes
RPCRecoGeometryRcd / Yes Yes Yes Yes Yes Yes
RunInfoRcd / Yes Yes Yes Yes Yes Yes Yes
TrackerAlignmentErrorExtendedRcd / Yes Yes Yes Yes Yes
TrackerAlignmentRcd / Yes Yes Yes Yes Yes
TrackerSurfaceDeformationRcd / Yes Yes Yes Yes Yes

From this the JEC + JER + Geometry + RunInfo are expected, however the Muon + tracker + global alignment records are not.

The results can be reproduced by

cmsrel CMSSW_10_6_30
cd CMSSW_10_6_30/src
cmsenv

echo '{ "319450" : [[1, 100]] }' > step1_lumiRanges.log
(dasgoclient --limit 0 --query 'lumi,file dataset=/JetHT/Run2018C-UL2018_MiniAODv2-v1/MINIAOD run=319450' --format json | das-selected-lumis.py 1,100 ) > step1_dasquery.log

cmsDriver.py step2  --conditions auto:run2_data_promptlike -s NANO --datatier NANOAOD -n 10 --data  --era Run2_2018,run2_nanoAOD_106Xv2 --eventcontent NANOEDMAOD --filein filelist:step1_dasquery.log --lumiToProcess step1_lumiRanges.log --fileout file:step2.root --customise_commands='process.GlobalTag.DumpStat=cms.untracked.bool(True)' 

(inspired by 136.8523)

cmsbuild commented 2 years ago

A new Issue was created by @tvami Tamas Vami.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

tvami commented 2 years ago

assign xpog

cmsbuild commented 2 years ago

New categories assigned: xpog

@mariadalfonso,@gouskos,@swertz,@vlimant you have been requested to review this Pull request/Issue and eventually sign? Thanks

tvami commented 2 years ago

I also checked CMSSW_12_4_8 and CMSSW_12_5_X_2022-08-25-1100, the alignment records are still consumed

mmusich commented 2 years ago

out of curiosity I tried to run step2 of wf. 136.8523 by using an overridden tracker alignment record with a tag that doesn't have conditions data on the run which is used for the relval, prepared via:

$ conddb_import -c sqlite_file:myAlignments.db -f frontier://FrontierProd/CMS_CONDITIONS -i TrackerAlignment_v29_offline -t Alignments -b 345747
$ conddb_import -c sqlite_file:myAlignments.db -f frontier://FrontierProd/CMS_CONDITIONS -i TrackerAlignmentExtendedErrors_v16_offline_IOVs -t AlignmentErrors -b 345747

and then patching the configuration with:

# Other statements
from Configuration.AlCa.GlobalTag import GlobalTag
process.GlobalTag = GlobalTag(process.GlobalTag, 'auto:run2_data', '')
process.GlobalTag.toGet = cms.VPSet(
  cms.PSet(record = cms.string("TrackerAlignmentRcd"),
           tag = cms.string("Alignments"),
           connect = cms.string("sqlite_file:myAlignments.db")
          ),
  cms.PSet(record = cms.string("TrackerAlignmentErrorExtendedRcd"),
           tag = cms.string("AlignmentErrors"),
           connect = cms.string("sqlite_file:myAlignments.db")
          )
)

in order to see what module would start the crash and I see:

----- Begin Fatal Exception 26-Aug-2022 10:36:07 CEST-----------------------
An exception of category 'NoRecord' occurred while
   [0] Processing  Event run: 319450 lumi: 31 event: 42789123 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module NanoAODDQM/'nanoDQM'
   [3] Prefetching for module SimpleCandidateFlatTableProducer/'fatJetTable'
   [4] Prefetching for module PATJetRefSelector/'finalJetsAK8'
   [5] Prefetching for module PATJetUserDataEmbedder/'updatedJetsAK8WithUserData'
   [6] Prefetching for module PATJetUpdater/'updatedJetsAK8'
   [7] Prefetching for module PATJetSelector/'selectedUpdatedPatJetsAK8WithDeepInfo'
   [8] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedAK8WithDeepInfo'
   [9] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetMassRegressionJetTagsAK8WithDeepInfo'
   [10] Prefetching for module DeepBoostedJetTagInfoProducer/'pfParticleNetTagInfosAK8WithDeepInfo'
   [11] Prefetching for EventSetup module TransientTrackBuilderESProducer/''
   [12] Prefetching for EventSetup module GlobalTrackingGeometryESProducer/''
   [13] Calling method for EventSetup module TrackerDigiGeometryESModule/'trackerGeometryDB'
   [14] While getting dependent Record from Record TrackerDigiGeometryRecord
Exception Message:
No "TrackerAlignmentRcd" record found in the EventSetup.

 The Record is delivered by an ESSource or ESProducer but there is no valid IOV for the synchronization value.
 Please check 
   a) if the synchronization value is reasonable and report to the hypernews if it is not.
   b) else check that all ESSources have been properly configured. 
----- End Fatal Exception -------------------------------------------------

so it seems that the alignment are used because of the GlobalTrackingGeometry which is in turn used for running some BTV taggers.

mariadalfonso commented 2 years ago

Another case of alignment dependency seems in the PPS

With this commit in 10_6 @nsmith was able to bypass it i.e. commenting nanoSequenceOnlyData = cms.Sequence(protonTables + lhcInfoTable) https://github.com/cms-sw/cmssw/pull/39040/commits/cd90ed159e5829e0404841f478ea59615f88840f

mmusich commented 2 years ago

Another case of alignment dependency seems in the PPS

judging from the table above doesn't seem any PPS condition is used (at least for wf 136.8523)

mariadalfonso commented 2 years ago

Another case of alignment dependency seems in the PPS

judging from the table above doesn't seem any PPS condition is used (at least for wf 136.8523)

The geometry is called here. https://github.com/cms-sw/cmssw/pull/32616/files#diff-67a90255f126a36efd13f3396aa5aef7edb0ab7bd4744ece6c300e4f0ea7113bR158

mmusich commented 2 years ago

The geometry is called here.

geometry is not alignment (which is the topic of the issue...). Quoting

From this the JEC + JER + Geometry + RunInfo are expected, however the Muon + tracker + global alignment records are not.

mariadalfonso commented 2 years ago

so it seems that the alignment are used because of the GlobalTrackingGeometry which is in turn used for running some BTV taggers.

136.8523

ok, this is Run2 and we explicitly re-run the tagger since wasn't in UL-mini. But for Run3 nano the re-running should be off and dependency from BoostedJetONNXJetTagsProducer/'pfParticleNetMassRegressionJetTagsAK8WithDeepInfo' should be gone https://github.com/cms-sw/cmssw/blob/b54d06d5447f6bb55d1da6c42d3a413f526150a0/PhysicsTools/NanoAOD/python/jetsAK8_cff.py#L261

tvami commented 2 years ago

@mariadalfonso

ok, this is Run2 and we explicitly re-run the tagger since wasn't in UL-mini.

Right but you are asking us to provide you GTs for Run-2 data as well

But for Run3 nano the re-running should be off and dependency from

Ok that sounds reassuring, can you please post the cmsDriver command to reproduce the Run-3 wf? is there a runTheMatrix wf too for that? I'll check that too. But even if that's the true, it's just half the solution to the problem.

mariadalfonso commented 2 years ago

@mariadalfonso

ok, this is Run2 and we explicitly re-run the tagger since wasn't in UL-mini.

Right but you are asking us to provide you GTs for Run-2 data as well

Tracking and pf-candidate is already done in mini, so I do not expect re-running the particlenet for the b-tagging should really use the alignment; I asked the expert to see if something can be cleanup @hqucms

mmusich commented 2 years ago

Tracking and pf-candidate is already done in mini, so I do not expect re-running the particlenet for the b-tagging should really use the alignment;

a consumes statement to TransientTrackRecord is declared here,

https://github.com/cms-sw/cmssw/blob/24e950159584b8260e4d3ca0085d9c24a31a3101/RecoBTag/FeatureTools/plugins/DeepBoostedJetTagInfoProducer.cc#L197-L198

it should be consumed here:

https://github.com/cms-sw/cmssw/blob/24e950159584b8260e4d3ca0085d9c24a31a3101/RecoBTag/FeatureTools/plugins/DeepBoostedJetTagInfoProducer.cc#L263

tvami commented 2 years ago

Isnt that https://github.com/cms-sw/cmssw/blob/master/RecoBTag/FeatureTools/plugins/DeepDoubleXTagInfoProducer.cc#L118 and then https://github.com/cms-sw/cmssw/blob/master/RecoBTag/FeatureTools/plugins/DeepDoubleXTagInfoProducer.cc#L224

Or this DeepDoubleXTagInfo is not relevant for Nano?

mmusich commented 2 years ago

Or this DeepDoubleXTagInfo is not relevant for Nano?

it's not in the stack trace I posted above...

tvami commented 2 years ago

Or this DeepDoubleXTagInfo is not relevant for Nano?

it's not in the stack trace I posted above...

Right, in that stack trace I also didnt find anything, I found this other thing instead, and now I'm asking if that's relevant or not.

mmusich commented 2 years ago

Right, in that stack trace I also didnt find anything,

I think I pointed already to the exact point in which there is the consumes statement :)

slava77 commented 2 years ago

Right, in that stack trace I also didnt find anything,

I think I pointed already to the exact point in which there is the consumes statement :)

this pointed to use of TransientTrackBuilder. This is used in IPTools, the primary tool for many btagging algos. Poking around a bit, it seems like the TrackingGeometry is used mainly for the track inner/outer state Surface building.

slava77 commented 2 years ago

Poking around a bit, it seems like the TrackingGeometry is used mainly for the track inner/outer state Surface building.

The TSOS(state on surface) built from TrackExtra may not be meaningful though if the surface derived from the DetId and new geometry is different from the free state. 🤔

tvami commented 2 years ago

But for Run3 nano the re-running should be off and dependency from Ok that sounds reassuring, can you please post the cmsDriver command to reproduce the Run-3 wf?

So I ran in CMSSW_12_5_X_2022-08-25-1100 the following:

cmsDriver.py RECO --conditions 124X_dataRun3_Prompt_v4 --datatier NANOAOD --era Run3 --eventcontent NANOEDMAOD --filein /store/data/Run2022C/DoubleMuon/MINIAOD/PromptReco-v1/000/355/862/00000/5e7b3b04-fb7b-432a-9ca5-f249163a8ced.root --fileout file:nano.root -n 10 --python_filename ReReco_Nano_0_cfg.py --scenario pp --step NANO --data --customise_commands='process.GlobalTag.DumpStat=cms.untracked.bool(True)'

Indeed for Run-3 in master, we dont have the alignment dependency, so this issue is about Run-2 only (which is anyway the era that triggered the discussion).

mariadalfonso commented 2 years ago

But for Run3 nano the re-running should be off and dependency from Ok that sounds reassuring, can you please post the cmsDriver command to reproduce the Run-3 wf?

So I ran in CMSSW_12_5_X_2022-08-25-1100 the following:

cmsDriver.py RECO --conditions 124X_dataRun3_Prompt_v4 --datatier NANOAOD --era Run3 --eventcontent NANOEDMAOD --filein /store/data/Run2022C/DoubleMuon/MINIAOD/PromptReco-v1/000/355/862/00000/5e7b3b04-fb7b-432a-9ca5-f249163a8ced.root --fileout file:nano.root -n 10 --python_filename ReReco_Nano_0_cfg.py --scenario pp --step NANO --data --customise_commands='process.GlobalTag.DumpStat=cms.untracked.bool(True)'

Indeed for Run-3 in master, we dont have the alignment dependency, so this issue is about Run-2 only (which is anyway the era that triggered the discussion).

for completeness: can you list the Record Label used here ?

tvami commented 2 years ago

But for Run3 nano the re-running should be off and dependency from Ok that sounds reassuring, can you please post the cmsDriver command to reproduce the Run-3 wf?

So I ran in CMSSW_12_5_X_2022-08-25-1100 the following:

cmsDriver.py RECO --conditions 124X_dataRun3_Prompt_v4 --datatier NANOAOD --era Run3 --eventcontent NANOEDMAOD --filein /store/data/Run2022C/DoubleMuon/MINIAOD/PromptReco-v1/000/355/862/00000/5e7b3b04-fb7b-432a-9ca5-f249163a8ced.root --fileout file:nano.root -n 10 --python_filename ReReco_Nano_0_cfg.py --scenario pp --step NANO --data --customise_commands='process.GlobalTag.DumpStat=cms.untracked.bool(True)'

Indeed for Run-3 in master, we dont have the alignment dependency, so this issue is about Run-2 only (which is anyway the era that triggered the discussion).

for completeness: can you list the Record Label used here ?

Record Label Tag
GBRDWrapperRcd electron_eb_ECALTRK GEDelectron_track_EBCorrection_80X_EGM_v4
GBRDWrapperRcd electron_eb_ECALTRK_lowpt GEDelectron_track_lowpt_EBCorrection_80X_EGM_v4
GBRDWrapperRcd electron_eb_ECALTRK_lowpt_var GEDelectron_track_lowpt_EBUncertainty_80X_EGM_v4
GBRDWrapperRcd electron_eb_ECALTRK_var GEDelectron_track_EBUncertainty_80X_EGM_v4
GBRDWrapperRcd electron_ee_ECALTRK GEDelectron_track_EECorrection_80X_EGM_v4
GBRDWrapperRcd electron_ee_ECALTRK_lowpt GEDelectron_track_lowpt_EECorrection_80X_EGM_v4
GBRDWrapperRcd electron_ee_ECALTRK_lowpt_var GEDelectron_track_lowpt_EEUncertainty_80X_EGM_v4
GBRDWrapperRcd electron_ee_ECALTRK_var GEDelectron_track_EEUncertainty_80X_EGM_v4
JetCorrectionsRecord AK4PFPuppi JetCorrectorParametersCollection_Summer16_23Sep2016AllV4_DATA_AK4PFPuppi
JetCorrectionsRecord AK8PFPuppi JetCorrectorParametersCollection_Summer16_23Sep2016AllV4_DATA_AK8PFPuppi
L1GtTriggerMaskAlgoTrigRcd L1GtTriggerMaskAlgoTrig_CRAFT09v2_hlt
L1GtTriggerMenuRcd L1GtTriggerMenu_CRAFT09_hlt
L1TUtmTriggerMenuRcd L1TUtmTriggerMenu_Stage2v0_hlt
LHCInfoRcd LHCInfoEndFill_Run3_v0
vlimant commented 1 year ago

@tvami I am a bit confused looking at the Rcd report in a recent nano workflows

runTheMatrix.py --what nano --site "" --command '--number 5 --customise_commands "process.GlobalTag.DumpStat=cms.untracked.bool(True)"' -l 2500.601,2500.332 --job-report

it seems to pull in a very very large number of records. I have done something wrong ?

tvami commented 1 year ago

hi @vlimant sorry, I was on vacation until now. It does give a huge list all the time, but then when the record is accessed it also shows which IOV is accessed. It's true that it's not very user friendly to read the output.

vlimant commented 1 year ago

ok, thanks for confirming. having filtered the record with no payload I somehow still have a larger set than in your latest comment. Is there a simple way to trace what module loaded a specific record ?

vlimant commented 1 year ago

latest I got is under https://cernbox.cern.ch/s/JqTmbbfQSjdtyDc

tvami commented 1 year ago

interesting, so I was looking at 136.8523 which is for data... so it seems the MC has more dependencies.

Is there a simple way to trace what module loaded a specific record ?

No, not that I know of

Dr15Jones commented 1 year ago

Is there a simple way to trace what module loaded a specific record ?

If you run a single threaded job and add in

process.add_(cms.Service("Tracer", dumpEventSetupInfo = cms.untracked.bool(True))

you will see a module do a prefetch request and then see which EventSetup modules are triggered.

vlimant commented 1 year ago

please close

we can reopen if ever this becomes a problem again ; point at which we will need a valid tool to investigate the dependency