cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

Recurrent fastsim2 addOn test failure #22721

Closed fabiocos closed 6 years ago

fabiocos commented 6 years ago

In recent days a recurrent failure in the fastsim2 addOn test has been noticed, although this does not appear in the IB and it is not trivially reproducible. The pattern of the problem is:

----- Begin Fatal Exception 22-Mar-2018 21:58:34 CET----------------------- An exception of category 'fastsim::ParticleManager' occurred while [0] Processing Event run: 1 lumi: 1 event: 55 stream: 0 [1] Running path 'simulation_step' [2] Calling method for module FastSimProducer/'fastSimProducer' Exception Message: unknown pdg id: 9010221 ----- End Fatal Exception -------------------------------------------------

I am trying to reproduce it in a long run. The problem has become frequent enough to deserve an investigation.

cmsbuild commented 6 years ago

A new Issue was created by @fabiocos Fabio Cossutti.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fabiocos commented 6 years ago

assign fastsim

cmsbuild commented 6 years ago

New categories assigned: fastsim

@mdhildreth,@ssekmen,@lveldere,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

fabiocos commented 6 years ago

The problem is triggered here:

https://github.com/cms-sw/cmssw/blob/master/FastSimulation/SimplifiedGeometryPropagator/src/ParticleManager.cc#L94

but clearly this is the consequence of some particle with a not recognized id being produced from time to time, apparently in a non reproducible way (which sounds worrying)

fabiocos commented 6 years ago

Extending the addOnTest configuration from 100 to 10000 events and running it twice:

first run:

Begin processing the 7648th record. Run 1, Event 7648, LumiSection 1 on stream 2 at 23-Mar-2018 12:58:02.844 CET %MSG-e MissingProduct: HybridClusterProducer:cleanedHybridSuperClusters 23-Mar-2018 12:58:02 CET Run: 1 Event: 7648 could not get a handle on the EcalRecHitCollection! %MSG %MSG-e MissingProduct: HybridClusterProducer:uncleanedHybridSuperClusters 23-Mar-2018 12:58:02 CET Run: 1 Event: 7648 could not get a handle on the EcalRecHitCollection! %MSG Begin processing the 7649th record. Run 1, Event 7649, LumiSection 1 on stream 0 at 23-Mar-2018 12:58:02.969 CET ----- Begin Fatal Exception 23-Mar-2018 12:58:02 CET----------------------- An exception of category 'fastsim::ParticleManager' occurred while [0] Processing Event run: 1 lumi: 1 event: 7648 stream: 2 [1] Running path 'simulation_step' [2] Calling method for module FastSimProducer/'fastSimProducer' Exception Message: unknown pdg id: 30443 ----- End Fatal Exception -------------------------------------------------

Second run:

Begin processing the 5192nd record. Run 1, Event 5192, LumiSection 1 on stream 2 at 23-Mar-2018 21:26:02.188 CET %MSG-e MissingProduct: HybridClusterProducer:cleanedHybridSuperClusters 23-Mar-2018 21:26:02 CET Run: 1 Event: 5192 could not get a handle on the EcalRecHitCollection! %MSG %MSG-e MissingProduct: HybridClusterProducer:uncleanedHybridSuperClusters 23-Mar-2018 21:26:02 CET Run: 1 Event: 5192 could not get a handle on the EcalRecHitCollection! %MSG ----- Begin Fatal Exception 23-Mar-2018 21:26:02 CET----------------------- An exception of category 'fastsim::ParticleManager' occurred while [0] Processing Event run: 1 lumi: 1 event: 5192 stream: 2 [1] Running path 'simulation_step' [2] Calling method for module FastSimProducer/'fastSimProducer' Exception Message: unknown pdg id: 9010221 ----- End Fatal Exception -------------------------------------------------

The code behaves in a non reproducible way, and in both crashes in the same event appears the same MSG-e error message, this is a pattern present also in the crashes during PR tests.

fabiocos commented 6 years ago

from some additional checks run during the week-end: a single-threaded run hangs (in a reproducible way) at event 1735. I have a complete dump with tracer and memory check available, here is the last piece:

++++ finished: processing event : stream = 0 run = 1 lumi = 1 event = 1734 time = 8670000001 ++++ starting: source event ++++ finished: source event Begin processing the 1735th record. Run 1, Event 1735, LumiSection 1 on stream 0 at 25-Mar-2018 18:27:05.263 CEST ++++ starting: processing event : stream = 0 run = 1 lumi = 1 event = 1735 time = 8675000001 ++++++ starting: processing path 'DQMoutput_step' : stream = 0 ++++++ starting: processing path 'AODSIMoutput_step' : stream = 0 ++++++ starting: processing path 'validation_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'genstepfilter' id = 299 ++++++ starting: processing path 'genfiltersummary_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'genFilterEfficiencyProducer' id = 297 ++++++ starting: processing path 'prevalidation_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'generator' id = 16 ++++++ starting: processing path 'eventinterpretaion_step' : stream = 0 ++++++ starting: processing path 'reconstruction_step' : stream = 0 ++++++ starting: processing path 'L1Reco_step' : stream = 0 ++++++ starting: processing path 'digi2raw_step' : stream = 0 ++++++ starting: processing path 'L1simulation_step' : stream = 0 ++++++ starting: processing path 'digitisation_step' : stream = 0 ++++++ starting: processing path 'reconstruction_befmix_step' : stream = 0 ++++++ starting: processing path 'simulation_step' : stream = 0 ++++++ starting: processing path 'generation_step' : stream = 0 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'generator' id = 16 ++++++++ starting: processing event for module: stream = 0 label = 'generator' id = 16 ++++++++ finished: processing event for module: stream = 0 label = 'generator' id = 16 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'mix' id = 39 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'pfPileUpEI' id = 148 ++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'offlinePrimaryVertices' id = 733 ++++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'trackRefsForJetsBeforeSorting' id = 851 ++++++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'trackWithVertexRefSelectorBeforeSorting' id = 853 ++++++++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'unsortedOfflinePrimaryVertices' id = 860 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'bunchSpacingProducer' id = 79 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ starting: processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ finished: processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ starting: processing event for module: stream = 0 label = 'L1Reco_step' id = 8 ++++++++ finished: processing event for module: stream = 0 label = 'L1Reco_step' id = 8 ++++++ finished: processing path 'L1Reco_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'caloLayer1RawFed1354' id = 63 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'simCaloStage2Layer1Digis' id = 51 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ starting: processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ finished: processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'fastTrackerRecHits' id = 34 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'fastSimProducer' id = 31 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ starting: processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ finished: processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ starting: processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ finished: processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ starting: processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ finished: processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'fastSimProducer' id = 31 ++++++++ starting: processing event for module: stream = 0 label = 'fastSimProducer' id = 31

It looks that fastsim needs some serious check about robustness in long runs, although IBs look generally clean.

fabiocos commented 6 years ago

The validation report at the PPD meeting of March 29 https://indico.cern.ch/event/718013/contributions/2950677/attachments/1625377/2588128/RelVal10_1_0_pre3_Mar29.pdf confirms that there is a ~ 10% rate of job failure for fastsim in the 10_1_0_pre3 relval production, possibly increased wrt previous releases.

kpedro88 commented 6 years ago

Might be helpful to valgrind it.

fabiocos commented 6 years ago

@ssekmen @civanch do you have any update on this issue?

ssekmen commented 6 years ago

@fabiocos , sorry for the silence. This was a problem we had encountered earlier, and thought to be resolved by correctly setting simtrack IDs. We are investigating further.

kpedro88 commented 6 years ago

Here is a backtrace and some gdb printouts for the hang on event 1735 in single-threaded mode:

#0  0x0000003644608dab in __ieee754_asin () from /lib64/libm.so.6
#1  0x0000003644624902 in asin () from /lib64/libm.so.6
#2  0x00007fffdbfa1029 in fastsim::HelixTrajectory::nextCrossingTimeC (this=0x7fff5fa8ee60, layer=..., onLayer=<optimized out>)
    at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/HelixTrajectory.cc:109
#3  0x00007fffdbfa698d in fastsim::LayerNavigator::moveParticleToNextLayer (this=this@entry=0x7fffffff4ee0, particle=..., layer=@0x7fffffff4de0: 0x0)
    at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/LayerNavigator.cc:234
#4  0x00007fffdbfee0bd in FastSimProducer::produce (this=0x7fffd0871000, iEvent=..., iSetup=...)
    at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/plugins/FastSimProducer.cc:246

(gdb) frame 2
#2  0x00007fffdbfa1029 in fastsim::HelixTrajectory::nextCrossingTimeC (this=0x7fff5fa8ee60, layer=..., onLayer=<optimized out>)
    at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/HelixTrajectory.cc:109
109             phi2 = std::asin((-b + sqrtDelta) / (2.*a));
(gdb) print phi2
$1 = 0
(gdb) print -b
value has been optimized out
(gdb) print sqrtDelta
$2 = <optimized out>
(gdb) print a
$3 = 1.6019137863755555e+17
(gdb) frame 3
#3  0x00007fffdbfa698d in fastsim::LayerNavigator::moveParticleToNextLayer (this=this@entry=0x7fffffff4ee0, particle=..., layer=@0x7fffffff4de0: 0x0)
    at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/LayerNavigator.cc:234
234             double tempDeltaTime = trajectory->nextCrossingTimeC(*_layer, particle.isOnLayer(_layer->isForward(), _layer->index()));
(gdb) print _layer
$4 = (const fastsim::SimplifiedGeometry *) 0x7fff5ec5ed00
(gdb) print *_layer
$5 = {_vptr.SimplifiedGeometry = 0x7fffdbfb2708 <vtable for fastsim::BarrelSimplifiedGeometry+16>, geomProperty_ = 3.0030000000000001, index_ = 0, 
  detLayer_ = 0x0, magneticFieldHist_ = std::unique_ptr<TH1F> containing 0x7fff5fc4ac00, thicknessHist_ = std::unique_ptr<TH1F> containing 0x7fff5fc76800, 
  nuclearInteractionThicknessFactor_ = 1, interactionModels_ = std::vector of length 6, capacity 8 = {0x7fffd062c9c0, 0x7fffd0b71800, 0x7fffd062c980, 
    0x7fffd0a68100, 0x7fffd062ca00, 0x7fffd06267a0}, caloType_ = fastsim::SimplifiedGeometry::NONE}
(gdb) frame 4
#4  0x00007fffdbfee0bd in FastSimProducer::produce (this=0x7fffd0871000, iEvent=..., iSetup=...)
    at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/plugins/FastSimProducer.cc:246
246                 while(layerNavigator.moveParticleToNextLayer(*particle,layer))
(gdb) print particle
$6 = std::unique_ptr<fastsim::Particle> containing 0x7fff5fa8edc0
(gdb) print *particle
$7 = {pdgId_ = -13, charge_ = 1, position_ = {fCoordinates = {fX = 0.031826737219976059, fY = -0.00027647409131310462, fZ = -1.2745637744665146, 
      fT = 0.10884436539269637}}, momentum_ = {fCoordinates = {fX = 161.63149706586913, fY = 0.053018557368889621, fZ = -132.28076371441142, 
      fT = 208.86109084179623}}, remainingProperLifeTimeC_ = 116093.80875165977, simTrackIndex_ = 0, simVertexIndex_ = 0, genParticleIndex_ = 422, 
  isOnForwardLayer_ = false, isOnLayerIndex_ = -1, energyDeposit_ = 0, isLooper_ = false, motherDeltaR_ = -1, motherPpdId_ = 0, motherSimTrackIndex_ = -999}

This looks like an infinite loop in the propagator. I hope someone more familiar with the code can look at it further.

fabiocos commented 6 years ago

@kpedro88 thanks for the investogation. @ssekmen is there any progress here?

ssekmen commented 6 years ago

Not yet unfortunately. Kevin's valgrind tests pointed out that either the material interaction modules or the decayer might be causing the issue. @skurz and I are investigating further.

skurz commented 6 years ago

I'm making progress in both issues (unknown pdgId and infinite loop) and I expect to have a PR tomorrow.

skurz commented 6 years ago

Both issues are expected to be fixed in this PR https://github.com/cms-sw/cmssw/pull/23019

fabiocos commented 6 years ago

@skurz very welcome, thanks, I have a question in that PR about the proposed fix