Failure in reading back 12_4_HLT_X `Run3ScoutingVertex` collection from CMSSW_14_0_X

mmusich commented 1 week ago

@AdrianoDee reported a failure when setting up Summer24 campaigns (see here for some previous discussion). When running (as agreed elsewhere) the data generation chain in CMSSW_14_0_X (while running the HLT in CMSSW_12_4_HLT_X) a Fatal Root Error occurs:

----- Begin Fatal Exception 05-Sep-2024 08:21:25 UTC-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'RECOSIMoutput_step'
   [2] Prefetching for module PoolOutputModule/'RECOSIMoutput'
   [3] While reading from source std::vector<Run3ScoutingVertex> hltScoutingPrimaryVertexPacker 'primaryVtx' HLT
   [4] Reading branch Run3ScoutingVertexs_hltScoutingPrimaryVertexPacker_primaryVtx_HLT.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<Run3ScoutingVertex> read too many bytes: 57 instead of 45
----- End Fatal Exception -------------------------------------------------

Suspicion is that Run3ScoutingVertex data format changed in 14_0_X, that is the release PdmV uses for all the steps but HLT. In this specific case, where new data members have been added to the original format. To be honest at the moment I am baffled by this. Backwards compatibility of scouting data-formats should be guaranteed as they are treated as any other "RAW" data-tier. There are dedicated unit tests that ensure we can continue to be able reading in newer releases HLT scouting files produced in 12_4_0 and 13_0_3, see here and here for more details. A reproducer (in plain CMSSW_12_4_0) is:

wrkdir=${PWD}/test_${RANDOM}

mkdir $wrkdir

release=CMSSW_12_4_0
if [ ! -r ${release}/src ] ;
   then scram p CMSSW ${release} ;
fi

cd ${release}/src
eval `scram runtime -sh`
cd $wrkdir

cmsDriver.py SingleMuPt1_pythia8_cfi  -s GEN,SIM -n 10 --conditions 124X_mcRun3_2022_realistic_v5 --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --relval 25000,100 --fileout file:step1.root

cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions 124X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --filein  file:step1.root  --fileout file:step2.root

release=CMSSW_14_0_0

if [ ! -r ${release}/src ] ;
   then scram p CMSSW ${release} ;
fi

cd ${release}/src
eval `scram runtime -sh`
cd $wrkdir

cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions 140X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --filein  file:step2.root  --fileout file:step3.root

cmsbuild commented 1 week ago

cms-bot internal usage

cmsbuild commented 1 week ago

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 1 week ago

I am not sure if it is relevant, but testing with:

void myScript() {
    // Open the ROOT file
    //TFile *file = TFile::Open("testRun3Scouting_v3_v5_v3_v4_v5_v3_v5_v3_v3_CMSSW_12_4_0.root");
    TFile *file = TFile::Open("step2.root");

    // Check if file is successfully opened
    if (!file || file->IsZombie()) {
        std::cerr << "Error opening file!" << std::endl;
        return;
    }

    // Show streamer information
    file->ShowStreamerInfo();

    // Close the file
    file->Close();
}

either the file out of the test at https://github.com/cms-sw/cmssw/issues/45888#issue-2507801930 or the dedicated unit test file for CMSSW_12_4_X (here), I am getting different results: In one case I have:

StreamerInfo for class: edm::Wrapper<vector<Run3ScoutingVertex> >, version=3, checksum=0xfab077ed
  edm::WrapperBase BASE            offset=  0 type= 0
  bool           present         offset=  0 type=18
  vector<Run3ScoutingVertex> obj             offset=  0 type=300 ,stl=1, ctype=61,  (Run3ScoutingVertex)

StreamerInfo for class: Run3ScoutingVertex, version=3, checksum=0x12dc9576
  float          x_              offset=  0 type= 5
  float          y_              offset=  0 type= 5
  float          z_              offset=  0 type= 5
  float          zError_         offset=  0 type= 5
  float          xError_         offset=  0 type= 5
  float          yError_         offset=  0 type= 5
  int            tracksSize_     offset=  0 type= 3
  float          chi2_           offset=  0 type= 5
  int            ndof_           offset=  0 type= 3
  bool           isValidVtx_     offset=  0 type=18

while in the other case (the file coming from the HLT) only:

StreamerInfo for class: edm::Wrapper<vector<Run3ScoutingVertex> >, version=3, checksum=0xfab077ed
  edm::WrapperBase BASE            offset=  0 type= 0
  bool           present         offset=  0 type=18
  vector<Run3ScoutingVertex> obj             offset=  0 type=300 ,stl=1, ctype=61,

vlimant commented 1 week ago

N.B. changing the split level of the step2 FEVTDEBUGHLT output above to 1 (instead of 0 default), let things to run smoothly

vlimant commented 1 week ago

do we know how were those testRun3Scouting_*root files created ? I suspect they were split further then what files coming from HLT are ; my 2 cents

mmusich commented 1 week ago

do we know how were those testRun3Scouting_*root files created ? I suspect they were split further then what files coming from HLT are ; my 2 cents

yes, there's a detailed description here.

vlimant commented 1 week ago

If I read this correctly https://github.com/cms-sw/cmssw/blob/2a53ad993325fe72593bd588fbc7750927b434fb/DataFormats/Scouting/test/create_Run3Scouting_test_file_cfg.py#L97 the split level is the default (= 9) of the output module ; would explain why the test succeeds there, but not in the recipe above. the read test should be done, IMO, from an actual file that was produced by the HLT. I'm confused why it's not the case, but there must be a reason.

mmusich commented 1 week ago

the split level is the default (= 9)

somewhat irrelevant, but I think the default is 99 (not 9):

https://github.com/cms-sw/cmssw/blob/2a53ad993325fe72593bd588fbc7750927b434fb/IOPool/Output/src/PoolOutputModule.cc#L480

I'm confused why it's not the case, but there must be a reason.

the test was created by the core framework group in https://github.com/cms-sw/cmssw/pull/41834 (tagging @wddgit)

mmusich commented 1 week ago

assign core, hlt

cmsbuild commented 1 week ago

New categories assigned: core,hlt

@Dr15Jones,@Martin-Grunewald,@mmusich,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 1 week ago

I'm confused why it's not the case, but there must be a reason.

the main advantage I see, is that you can fully control the values you put in input (and check that you get exactly those back when reading)

mmusich commented 1 week ago

the read test should be done, IMO, from an actual file that was produced by the HLT.

moreover the actual HLT is using the default splitLevel, as the scouting dataset is configured as

process.hltOutputScoutingPF = cms.OutputModule( "PoolOutputModule",
    fileName = cms.untracked.string( "outputScoutingPF.root" ),
    compressionAlgorithm = cms.untracked.string( "ZLIB" ),
    compressionLevel = cms.untracked.int32( 1 ),
    fastCloning = cms.untracked.bool( False ),
    dataset = cms.untracked.PSet(
        filterName = cms.untracked.string( "" ),
        dataTier = cms.untracked.string( "RAW" )
    ),
    SelectEvents = cms.untracked.PSet(  SelectEvents = cms.vstring( 'Dataset_ScoutingPFRun3' ) ),
    outputCommands = cms.untracked.vstring( 'drop *',
      'keep *_hltFEDSelectorL1_*_*',
      'keep *_hltScoutingEgammaPacker_*_*',
      'keep *_hltScoutingMuonPacker_*_*',
      'keep *_hltScoutingPFPacker_*_*',
      'keep *_hltScoutingPrimaryVertexPacker_*_*',
      'keep *_hltScoutingTrackPacker_*_*',
      'keep edmTriggerResults_*_*_*' )
)

splitLevel = cms.untracked.int32(0) is coming from EventContent_cff:

https://github.com/cms-sw/cmssw/blob/c9df55d4cbef41348ffc85cbd9bd2d43da29f6a4/Configuration/EventContent/python/EventContent_cff.py#L620-L623

makortel commented 1 week ago

I'm confused why it's not the case, but there must be a reason.

the main advantage I see, is that you can fully control the values you put in input (and check that you get exactly those back when reading)

This is indeed the case. In order to test all the nuances of the ROOT I/O, the inputs are better to be fully controlled rather than coming from the data (or simulation).

The point on the behavior depending on the split level is good though (given we have seen the schema evolution behavior to depend on it), perhaps we need to think of testing both splitlevel 0 and 99 in all of these tests.

makortel commented 1 week ago

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

----- Begin Fatal Exception 05-Sep-2024 12:47:53 CEST-----------------------
An exception of category 'ConditionsError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLT_PPSMaxTracksPerRP4_v2'
   [2] Calling method for module L1TGlobalProducer/'hltGtStage2ObjectMap'
Exception Message:
 Error L1 menu loaded in via conditions does not match the L1 actually run 1517097079 vs 2016981387. This means that the mapping of the names to the bits may be incorrect. Please check the L1TUtmTriggerMenuRcd record supplied. Unless you know what you are doing, do not simply disable this check via the config as this a major error and the indication of something very wrong
----- End Fatal Exception -------------------------------------------------

and therefore the subsequent steps have never been tested (until now).

mmusich commented 1 week ago

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

the reasons of these failures are explained at https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/15#note_8410470 and are entirely ascribable to inconsistent settings of the input Global Tags (namely the tag associated to L1TUtmTriggerMenuRcd) by PdmV / AlCa

makortel commented 1 week ago

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

the reasons of these failures are explained at https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/15#note_8410470 and are entirely ascribable to inconsistent settings of the input Global Tags (namely the tag associated to L1TUtmTriggerMenuRcd) by PdmV / AlCa

But I guess those inconsistencies have been resolved somewhere given the recipe in the description?

makortel commented 1 week ago

I guess the problem is https://github.com/cms-sw/cmssw/issues/41348 and can be fixed by including --customise IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files to cmsDriver.py.

mmusich commented 1 week ago

But I guess those inconsistencies have been resolved somewhere given the recipe in the description?

The recipe in description works, because of

cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions 124X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --filein  file:step1.root  --fileout file:step2.root

i.e. L1,DIGI2RAW and HLT are running in the same step with the same GlobalTag (which I would not call a solution).

makortel commented 1 week ago

I tested the recipe in the description, and can confirm adding the --customise IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files fixes the problem.

mmusich commented 1 week ago

indeed, just for the record this [1] works fine.

[1]

Click me

```bash wrkdir=${PWD}/test_${RANDOM} mkdir $wrkdir release=CMSSW_12_4_0 if [ ! -r ${release}/src ] ; then scram p CMSSW ${release} ; fi cd ${release}/src eval `scram runtime -sh` cd $wrkdir cmsDriver.py SingleMuPt1_pythia8_cfi -s GEN,SIM -n 10 --conditions 124X_mcRun3_2022_realistic_v5 --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Ex tended --era Run3 --relval 25000,100 --fileout file:step1.root cmsDriver.py step2 -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions 124X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era R un3 --filein file:step1.root --fileout file:step2.root release=CMSSW_14_0_0 if [ ! -r ${release}/src ] ; then scram p CMSSW ${release} ; fi cd ${release}/src eval `scram runtime -sh` cd $wrkdir cmsDriver.py step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions 140X_mcRun3_2022_realistic_v 5 --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --filein file:step2.root --fileout file:step3.root --customise IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files ```

mmusich commented 1 week ago

re-iterating on this:

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

This patch

diff --git a/Configuration/AlCa/python/autoCond.py b/Configuration/AlCa/python/autoCond.py
index 01e4c09ddac..96ef7ec02a1 100644
--- a/Configuration/AlCa/python/autoCond.py
+++ b/Configuration/AlCa/python/autoCond.py
@@ -115,6 +115,9 @@ autoCond = autoCondRelValForRun2(autoCond)
 from Configuration.AlCa.autoCondModifiers import autoCondRelValForRun3
 autoCond = autoCondRelValForRun3(autoCond)

+from Configuration.AlCa.autoCondModifiers import autoCond2022MCTest
+autoCond = autoCond2022MCTest(autoCond)
+
 # GlobalTag for Run1 data reprocessing, history was carried over to run2 GTs
 autoCond['run1_data']        = autoCond['run2_data']

diff --git a/Configuration/AlCa/python/autoCondModifiers.py b/Configuration/AlCa/python/autoCondModifiers.py
index 7523b7b9943..89eaf320bdd 100644
--- a/Configuration/AlCa/python/autoCondModifiers.py
+++ b/Configuration/AlCa/python/autoCondModifiers.py
@@ -142,3 +142,14 @@ def autoCondBSHLLHC13TeV(autoCond):

     autoCond.update(GlobalTagBSHLLHC13TeV)
     return autoCond
+
+def autoCond2022MCTest(autoCond):
+
+    GlobalTagFor2022MC = {}
+    L1TUtmTriggerMenuFor2022v12 = ','.join( ['L1Menu_Collisions2022_v1_2_0_xml' , "L1TUtmTriggerMenuRcd", connectionString, "", ""] )
+    for key,val in autoCond.items():
+        if 'phase1_2022_realistic' in key :
+            GlobalTagFor2022MC[key+'_2022v12'] = (autoCond[key],
+                                                  L1TUtmTriggerMenuFor2022v12)
+    autoCond.update(GlobalTagFor2022MC)
+    return autoCond
diff --git a/Configuration/PyReleaseValidation/test/BuildFile.xml b/Configuration/PyReleaseValidation/test/BuildFile.xml
index 23681eb2d06..8abf9ff7ca9 100644
--- a/Configuration/PyReleaseValidation/test/BuildFile.xml
+++ b/Configuration/PyReleaseValidation/test/BuildFile.xml
@@ -22,7 +22,7 @@
 <!-- In CMSSW_12_4_20 the auto:phase1_2022_realistic (pre EE) is 124X_mcRun3_2022_realistic_v12 -->
 <ifarchitecture name="el8_amd64">
   <!-- Run this test for production arch of CMSSW_12_4_21_HLT release only -->
-  <test name="test_MC_22_setup" command="test_mc_setup/test_MC_setup.sh auto:phase1_2022_realistic Run3 2022v14 CMSSW_12_4_21_HLT 124X_mcRun3_2022_realistic_v12 Realistic25ns13p6TeVEarly2022Collision" />
+  <test name="test_MC_22_setup" command="test_mc_setup/test_MC_setup.sh auto:phase1_2022_realistic_2022v12 Run3 2022v12 CMSSW_12_4_21_HLT 124X_mcRun3_2022_realistic_v12 Realistic25ns13p6TeVEarly2022Collision" />
 </ifarchitecture>

 <!-- 
diff --git a/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh b/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh
index 1b37bee2e89..95e44d735b7 100755
--- a/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh
+++ b/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh
@@ -31,6 +31,7 @@ cmsDriver.py  --python_filename raw_digi.py --eventcontent RAWSIM \
 --customise Configuration/DataProcessing/Utils.addMonitoring \
 --datatier GEN-SIM-RAW --filein file:step1.root \
 --fileout file:step2.root --conditions $conditions --step DIGI,L1,DIGI2RAW \
+--customise_commands 'process.GlobalTag.DumpStat = cms.untracked.bool(True)' \
 --geometry DB:Extended --era $era --mc -n -1 --no_exec

 if [ $? -ne 0 ]; then

exposes the problem. I hesitate to make a PR to "fix" the test as at the moment we're still discussing within TSG / PPD if not nto move to yet another menu (v1.5) for re-HLT in 2022 MC. @cms-sw/pdmv-l2 @cms-sw/alca-l2

makortel commented 1 week ago

There are dedicated unit tests that ensure we can continue to be able reading in newer releases HLT scouting files produced in 12_4_0 and 13_0_3, see here and here for more details.

I tested re-producing the 12_4_0 test file with splitlevel set to 0, and with that input the TestRun3ScoutingDataFormats unit test indeed fails with similar symptoms, and the failure gets fixed with the IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files customize function.

The point on the behavior depending on the split level is good though (given we have seen the schema evolution behavior to depend on it), perhaps we need to think of testing both splitlevel 0 and 99 in all of these tests.

I think we really need to do this, and am spinning off that work into a separate issue https://github.com/cms-sw/cmssw/issues/45931.

makortel commented 1 week ago

+core

AdrianoDee commented 1 week ago

I hesitate to make a PR to "fix" the test as at the moment we're still discussing within TSG / PPD if not nto move to yet another menu (v1.5) for re-HLT in 2022 MC.

Thanks @mmusich, I was waiting too the final decision on the setup to decide how to fix the tests.

mmusich commented 1 week ago

I was waiting too the final decision on the setup to decide how to fix the tests.

It was ultimately decided to have the option of using V1.5 instead: thus I opened https://github.com/cms-sw/cmssw/pull/45939. If this is what we'll actually use, all tests in open cycles will need to be updated once we have a release containing it.

cms-sw / cmssw

Failure in reading back 12_4_HLT_X `Run3ScoutingVertex` collection from CMSSW_14_0_X #45888