Closed fabiocos closed 6 years ago
A new Issue was created by @fabiocos Fabio Cossutti.
@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign fastsim
New categories assigned: fastsim
@mdhildreth,@ssekmen,@lveldere,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks
The problem is triggered here:
but clearly this is the consequence of some particle with a not recognized id being produced from time to time, apparently in a non reproducible way (which sounds worrying)
Extending the addOnTest configuration from 100 to 10000 events and running it twice:
first run:
Begin processing the 7648th record. Run 1, Event 7648, LumiSection 1 on stream 2 at 23-Mar-2018 12:58:02.844 CET %MSG-e MissingProduct: HybridClusterProducer:cleanedHybridSuperClusters 23-Mar-2018 12:58:02 CET Run: 1 Event: 7648 could not get a handle on the EcalRecHitCollection! %MSG %MSG-e MissingProduct: HybridClusterProducer:uncleanedHybridSuperClusters 23-Mar-2018 12:58:02 CET Run: 1 Event: 7648 could not get a handle on the EcalRecHitCollection! %MSG Begin processing the 7649th record. Run 1, Event 7649, LumiSection 1 on stream 0 at 23-Mar-2018 12:58:02.969 CET ----- Begin Fatal Exception 23-Mar-2018 12:58:02 CET----------------------- An exception of category 'fastsim::ParticleManager' occurred while [0] Processing Event run: 1 lumi: 1 event: 7648 stream: 2 [1] Running path 'simulation_step' [2] Calling method for module FastSimProducer/'fastSimProducer' Exception Message: unknown pdg id: 30443 ----- End Fatal Exception -------------------------------------------------
Second run:
Begin processing the 5192nd record. Run 1, Event 5192, LumiSection 1 on stream 2 at 23-Mar-2018 21:26:02.188 CET %MSG-e MissingProduct: HybridClusterProducer:cleanedHybridSuperClusters 23-Mar-2018 21:26:02 CET Run: 1 Event: 5192 could not get a handle on the EcalRecHitCollection! %MSG %MSG-e MissingProduct: HybridClusterProducer:uncleanedHybridSuperClusters 23-Mar-2018 21:26:02 CET Run: 1 Event: 5192 could not get a handle on the EcalRecHitCollection! %MSG ----- Begin Fatal Exception 23-Mar-2018 21:26:02 CET----------------------- An exception of category 'fastsim::ParticleManager' occurred while [0] Processing Event run: 1 lumi: 1 event: 5192 stream: 2 [1] Running path 'simulation_step' [2] Calling method for module FastSimProducer/'fastSimProducer' Exception Message: unknown pdg id: 9010221 ----- End Fatal Exception -------------------------------------------------
The code behaves in a non reproducible way, and in both crashes in the same event appears the same MSG-e error message, this is a pattern present also in the crashes during PR tests.
from some additional checks run during the week-end: a single-threaded run hangs (in a reproducible way) at event 1735. I have a complete dump with tracer and memory check available, here is the last piece:
++++ finished: processing event : stream = 0 run = 1 lumi = 1 event = 1734 time = 8670000001 ++++ starting: source event ++++ finished: source event Begin processing the 1735th record. Run 1, Event 1735, LumiSection 1 on stream 0 at 25-Mar-2018 18:27:05.263 CEST ++++ starting: processing event : stream = 0 run = 1 lumi = 1 event = 1735 time = 8675000001 ++++++ starting: processing path 'DQMoutput_step' : stream = 0 ++++++ starting: processing path 'AODSIMoutput_step' : stream = 0 ++++++ starting: processing path 'validation_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'genstepfilter' id = 299 ++++++ starting: processing path 'genfiltersummary_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'genFilterEfficiencyProducer' id = 297 ++++++ starting: processing path 'prevalidation_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'generator' id = 16 ++++++ starting: processing path 'eventinterpretaion_step' : stream = 0 ++++++ starting: processing path 'reconstruction_step' : stream = 0 ++++++ starting: processing path 'L1Reco_step' : stream = 0 ++++++ starting: processing path 'digi2raw_step' : stream = 0 ++++++ starting: processing path 'L1simulation_step' : stream = 0 ++++++ starting: processing path 'digitisation_step' : stream = 0 ++++++ starting: processing path 'reconstruction_befmix_step' : stream = 0 ++++++ starting: processing path 'simulation_step' : stream = 0 ++++++ starting: processing path 'generation_step' : stream = 0 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'generator' id = 16 ++++++++ starting: processing event for module: stream = 0 label = 'generator' id = 16 ++++++++ finished: processing event for module: stream = 0 label = 'generator' id = 16 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'mix' id = 39 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'pfPileUpEI' id = 148 ++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'offlinePrimaryVertices' id = 733 ++++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'trackRefsForJetsBeforeSorting' id = 851 ++++++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'trackWithVertexRefSelectorBeforeSorting' id = 853 ++++++++++++++++ starting: prefetching before processing event for module: stream = 0 label = 'unsortedOfflinePrimaryVertices' id = 860 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'bunchSpacingProducer' id = 79 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ starting: processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ finished: processing event for module: stream = 0 label = 'l1extraParticles' id = 78 ++++++++ starting: processing event for module: stream = 0 label = 'L1Reco_step' id = 8 ++++++++ finished: processing event for module: stream = 0 label = 'L1Reco_step' id = 8 ++++++ finished: processing path 'L1Reco_step' : stream = 0 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'caloLayer1RawFed1354' id = 63 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'simCaloStage2Layer1Digis' id = 51 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ starting: processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ finished: processing event for module: stream = 0 label = 'offlineBeamSpot' id = 33 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'fastTrackerRecHits' id = 34 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'fastSimProducer' id = 31 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ starting: processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ finished: processing event for module: stream = 0 label = 'randomEngineStateProducer' id = 17 ++++++++ starting: prefetching before processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ starting: processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ finished: processing event for module: stream = 0 label = 'VtxSmeared' id = 18 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ starting: processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ finished: processing event for module: stream = 0 label = 'generatorSmeared' id = 19 ++++++++ finished: prefetching before processing event for module: stream = 0 label = 'fastSimProducer' id = 31 ++++++++ starting: processing event for module: stream = 0 label = 'fastSimProducer' id = 31
It looks that fastsim needs some serious check about robustness in long runs, although IBs look generally clean.
The validation report at the PPD meeting of March 29 https://indico.cern.ch/event/718013/contributions/2950677/attachments/1625377/2588128/RelVal10_1_0_pre3_Mar29.pdf confirms that there is a ~ 10% rate of job failure for fastsim in the 10_1_0_pre3 relval production, possibly increased wrt previous releases.
Might be helpful to valgrind it.
@ssekmen @civanch do you have any update on this issue?
@fabiocos , sorry for the silence. This was a problem we had encountered earlier, and thought to be resolved by correctly setting simtrack IDs. We are investigating further.
Here is a backtrace and some gdb printouts for the hang on event 1735 in single-threaded mode:
#0 0x0000003644608dab in __ieee754_asin () from /lib64/libm.so.6
#1 0x0000003644624902 in asin () from /lib64/libm.so.6
#2 0x00007fffdbfa1029 in fastsim::HelixTrajectory::nextCrossingTimeC (this=0x7fff5fa8ee60, layer=..., onLayer=<optimized out>)
at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/HelixTrajectory.cc:109
#3 0x00007fffdbfa698d in fastsim::LayerNavigator::moveParticleToNextLayer (this=this@entry=0x7fffffff4ee0, particle=..., layer=@0x7fffffff4de0: 0x0)
at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/LayerNavigator.cc:234
#4 0x00007fffdbfee0bd in FastSimProducer::produce (this=0x7fffd0871000, iEvent=..., iSetup=...)
at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/plugins/FastSimProducer.cc:246
(gdb) frame 2
#2 0x00007fffdbfa1029 in fastsim::HelixTrajectory::nextCrossingTimeC (this=0x7fff5fa8ee60, layer=..., onLayer=<optimized out>)
at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/HelixTrajectory.cc:109
109 phi2 = std::asin((-b + sqrtDelta) / (2.*a));
(gdb) print phi2
$1 = 0
(gdb) print -b
value has been optimized out
(gdb) print sqrtDelta
$2 = <optimized out>
(gdb) print a
$3 = 1.6019137863755555e+17
(gdb) frame 3
#3 0x00007fffdbfa698d in fastsim::LayerNavigator::moveParticleToNextLayer (this=this@entry=0x7fffffff4ee0, particle=..., layer=@0x7fffffff4de0: 0x0)
at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/src/LayerNavigator.cc:234
234 double tempDeltaTime = trajectory->nextCrossingTimeC(*_layer, particle.isOnLayer(_layer->isForward(), _layer->index()));
(gdb) print _layer
$4 = (const fastsim::SimplifiedGeometry *) 0x7fff5ec5ed00
(gdb) print *_layer
$5 = {_vptr.SimplifiedGeometry = 0x7fffdbfb2708 <vtable for fastsim::BarrelSimplifiedGeometry+16>, geomProperty_ = 3.0030000000000001, index_ = 0,
detLayer_ = 0x0, magneticFieldHist_ = std::unique_ptr<TH1F> containing 0x7fff5fc4ac00, thicknessHist_ = std::unique_ptr<TH1F> containing 0x7fff5fc76800,
nuclearInteractionThicknessFactor_ = 1, interactionModels_ = std::vector of length 6, capacity 8 = {0x7fffd062c9c0, 0x7fffd0b71800, 0x7fffd062c980,
0x7fffd0a68100, 0x7fffd062ca00, 0x7fffd06267a0}, caloType_ = fastsim::SimplifiedGeometry::NONE}
(gdb) frame 4
#4 0x00007fffdbfee0bd in FastSimProducer::produce (this=0x7fffd0871000, iEvent=..., iSetup=...)
at /uscms_data/d3/pedrok/hf/fast/vg/CMSSW_10_2_0_pre1/src/FastSimulation/SimplifiedGeometryPropagator/plugins/FastSimProducer.cc:246
246 while(layerNavigator.moveParticleToNextLayer(*particle,layer))
(gdb) print particle
$6 = std::unique_ptr<fastsim::Particle> containing 0x7fff5fa8edc0
(gdb) print *particle
$7 = {pdgId_ = -13, charge_ = 1, position_ = {fCoordinates = {fX = 0.031826737219976059, fY = -0.00027647409131310462, fZ = -1.2745637744665146,
fT = 0.10884436539269637}}, momentum_ = {fCoordinates = {fX = 161.63149706586913, fY = 0.053018557368889621, fZ = -132.28076371441142,
fT = 208.86109084179623}}, remainingProperLifeTimeC_ = 116093.80875165977, simTrackIndex_ = 0, simVertexIndex_ = 0, genParticleIndex_ = 422,
isOnForwardLayer_ = false, isOnLayerIndex_ = -1, energyDeposit_ = 0, isLooper_ = false, motherDeltaR_ = -1, motherPpdId_ = 0, motherSimTrackIndex_ = -999}
This looks like an infinite loop in the propagator. I hope someone more familiar with the code can look at it further.
@kpedro88 thanks for the investogation. @ssekmen is there any progress here?
Not yet unfortunately. Kevin's valgrind tests pointed out that either the material interaction modules or the decayer might be causing the issue. @skurz and I are investigating further.
I'm making progress in both issues (unknown pdgId and infinite loop) and I expect to have a PR tomorrow.
Both issues are expected to be fixed in this PR https://github.com/cms-sw/cmssw/pull/23019
@skurz very welcome, thanks, I have a question in that PR about the proposed fix
In recent days a recurrent failure in the fastsim2 addOn test has been noticed, although this does not appear in the IB and it is not trivially reproducible. The pattern of the problem is:
----- Begin Fatal Exception 22-Mar-2018 21:58:34 CET----------------------- An exception of category 'fastsim::ParticleManager' occurred while [0] Processing Event run: 1 lumi: 1 event: 55 stream: 0 [1] Running path 'simulation_step' [2] Calling method for module FastSimProducer/'fastSimProducer' Exception Message: unknown pdg id: 9010221 ----- End Fatal Exception -------------------------------------------------
I am trying to reproduce it in a long run. The problem has become frequent enough to deserve an investigation.