cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Fatal Exception in ALCAPPSExpress processing in Tier0 (run 366035) #41335

Open francescobrivio opened 1 year ago

francescobrivio commented 1 year ago

This issue is to keep track of possible future fixes (code-wise) of the issue reported in this CMSTalk post. The error appeared while processing the ALCAPPSExpress stream for run 366035, and the exception reported is:

----- Begin Fatal Exception 13-Apr-2023 09:21:07 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing global end Run run: 366035
   [1] Calling method for module PPSAlignmentHarvester/'ppsAlignmentHarvester'
Exception Message:
A std::exception was thrown.
map::at
----- End Fatal Exception -------------------------------------------------

The actual issue was traced back to an update of the PPSAlignmentConfig conditions happened few days ago (CMSTalk announcement), and the conditions have now been rolled-back until the problem is understood by PPS experts.

@mmusich kindly pointed out that the issue is most probably originating in this line: https://github.com/cms-sw/cmssw/blob/8c5f3c7d2257166af259dc5517c462dce5ce199c/CalibPPS/AlignmentGlobal/plugins/PPSAlignmentHarvester.cc#L614 when rpc.id_ is equal to 23 (I let PPS experts comment further on the exact meaning of this).

In addition, Marco pointed out that:

the code is packed with std::map element evaluations with bound checks (which leads to exceptions at runtime)"

which should probably be fixed as well.

francescobrivio commented 1 year ago

assign alca, ctpps-dpg

cmsbuild commented 1 year ago

New categories assigned: ctpps-dpg,alca

@vavati,@fabferro,@jan-kaspar,@francescobrivio,@saumyaphor4252,@tvami you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @francescobrivio .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

francescobrivio commented 1 year ago

I forgot to add the recipe to reproduce the error:

cmsrel CMSSW_13_0_3
cd CMSSW_13_0_3/src/
cmsenv
cp -r /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023A/job_250879/job/WMTaskSpace/ .
cd WMTaskSpace/cmsRun1/
cmsRun PSet.py

Instead, in order to use the "rolled-back" conditions (and cure the crash), you can simply edit PSet.py to be:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.GlobalTag.globaltag = '130X_dataRun3_Express_RecoverPPS_v1'
wpcarvalho commented 1 year ago

Updating this thread. The reason for the crash was understood as due to a missing (zero sized) data field (matchingReferencePoints) in the payload introduced in IOV=365978 of PPSAlignmentConfig_reference_Run3_v1_express consumed by the PCL. It has been properly fixed and a new payload will be submitted.

mmusich commented 1 year ago

. It has been properly fixed and a new payload will be submitted.

is this related to https://cms-talk.web.cern.ch/t/ppd-alcadb-gt-online-hlt-express-prompt-updated-pps-alignment-conditions-for-pcl/24053/1 ?

If yes I would suggest to validate the new payload in an express replay as well @cms-sw/alca-l2

wpcarvalho commented 1 year ago

. It has been properly fixed and a new payload will be submitted.

is this related to https://cms-talk.web.cern.ch/t/ppd-alcadb-gt-online-hlt-express-prompt-updated-pps-alignment-conditions-for-pcl/24053/1 ?

Yes.

francescobrivio commented 1 year ago

Hi @mmusich we were planning to test it with the dedicated relvals that test the PPS pcl (we are also setting this up in the AlcaVal tool), but indeed using an Express replay sounds like a good idea: I'll prepare it tomorrow.

mmusich commented 1 year ago

but indeed using an Express replay sounds like a good idea: I'll prepare it tomorrow.

Thanks @francescobrivio.

I was wondering if the PPS experts can also comment on this:

the code is packed with std::map element evaluations with bound checks (which leads to exceptions at runtime)"

crashes at Tier-0 have a human and computing cost. It would be better to avoid having the code crash on a bad configuration. @wpcarvalho