Closed makortel closed 3 years ago
assign generators
New categories assigned: generators
@mkirsano,@SiewYan,@alberto-sanchez,@agrohsje,@GurpreetSinghChahal you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Following Error
printouts are visible in CMSSW_12_0_X_2021-04-27-1100
Error: The object '/Herwig/Partons/PDFSet_nnlo' was not created as another object with that name already exists.
Error: The object '/Herwig/Partons/PDFSet_lo' was not created as another object with that name already exists.
Error: The object '/Herwig/EventHandlers/LesHouchesHandler' was not created as another object with that name already exists.
Error: The object '/Herwig/Cuts/NoCuts' was not created as another object with that name already exists.
Error: The object '/Herwig/Partons/LHAPDF' was not created as another object with that name already exists.
Error: The object '/Herwig/EventHandlers/LesHouchesReader' was not created as another object with that name already exists.
where the job succeeded. It has one printout of Error: No such file or directory: cmsgrid_final.lhe
whereas the failing case has total of three of them. Can these be related to the crash or are they unrelated?
Hi @Dominic-Stafford @theofil , is one of you available to follow-up on the seg-fault?
I think the origin of the error is in #33516. Specifically, the change in Configuration/Generator/python/TTbar_Pow_LHE_13TeV_cff.py
affects https://github.com/cms-sw/cmssw/blob/master/Configuration/Generator/python/TT_13TeV_Pow_Herwig7_cff.py#L3.
It looks like we can fix this by forcing externalLHEProducer.generateConcurrently = cms.untracked.bool(False),
in TT_13TeV_Pow_Herwig7_cff.py
I don't know whether it was expected that Herwig7 works concurrently or not.
cc @SiewYan
Thanks. As far as I see the seg-fault is caused by the missing "cmsgrid_final.lhe". As I know, concurrent externalLHEProducer will produce LHE separately in each subfolder thread*/cmsgrid_final.lhe
, and combine them directly in CMSSW to produce a EDM LHE file. Therefore there is no final ./cmsgrid_final.lhe
produced.
It happened that H7 will intrinsically read the cmsgrid_final.lhe
directly instead of the EDM LHE file, therefore causing this crash.
This indicates that "concurrent externalLHEProducer" is not compatible with H7GeneratorFilter (sorry didn't notice this). To my understanding the easiest solution would be to write out a cmsgrid_final.lhe in the former step. What do you think? If it's fine I can make this PR shortly.
As @colizz said, Herwig directly reads in cmsgrid_final.lhe rather than the EDM LHE file. If the concurrent externalLHEProducer could be changed to also write out this file this should work, otherwise it would probably be simplest to not use this with Herwig 7. As @makortel mentioned, Herwig also produces some spurious error messages when running normally, which make diagnosing issue like these a bit harder. We believe this is because the CMSSW scheduler currently tries to call Herwig before the externalLHEProducer- @theofil is currently looking in to this.
ok thanks! I'll handle the "concurrent externalLHEProducer" side.
Is there any chance to have the fix ready by CMSSW_12_0_0_pre1 (next Tuesday)?
Hi, I just submitted the PR to fix this: #33615. The Herwig7 errors still occurs (as in the single-core case) but should be independent of the seg-fault raised here.
Hello,
I just observed a new segfault in H7. I use CMSSW_12_0_X_2021-05-24-2300
to test the process JME-RunIISummer20UL16wmLHEGEN-00003
. Here are the commands:
cmsrel CMSSW_12_0_X_2021-05-24-2300
cd CMSSW_12_0_X_2021-05-24-2300/src
cmsenv
curl -s -k https://cms-pdmv.cern.ch/mcm/public/restapi/requests/get_fragment/JME-RunIISummer20UL16wmLHEGEN-00003 --retry 3 --create-dirs -o Configuration/GenProduction/python/JME-RunIISummer20UL16wmLHEGEN-00003-fragment.py
[ -s Configuration/GenProduction/python/JME-RunIISummer20UL16wmLHEGEN-00003-fragment.py ] || exit $?;
scram b -j8
cd ../..
cmsDriver.py Configuration/GenProduction/python/JME-RunIISummer20UL16wmLHEGEN-00003-fragment.py --python_filename JME-RunIISummer20UL16wmLHEGEN-00003_1_cfg.py --eventcontent RAWSIM,LHE --customise Configuration/DataProcessing/Utils.addMonitoring --datatier GEN,LHE --fileout file:JME-RunIISummer20UL16wmLHEGEN-00003.root --conditions 112X_mcRun2_asymptotic_v2 --beamspot Realistic25ns13TeV2016Collision --customise_commands process.source.numberEventsInLuminosityBlock="cms.untracked.uint32(101)" --step LHE,GEN --geometry DB:Extended --era Run2_2016 --no_exec --mc -n 5000
cmsRun JME-RunIISummer20UL16wmLHEGEN-00003_1_cfg.py
The segfault occurs randomly when the events are processed at around 500-2000:
...
Begin processing the 776th record. Run 1, Event 776, LumiSection 8 on stream 0 at 25-May-2021 09:41:10.100 CEST
Begin processing the 777th record. Run 1, Event 777, LumiSection 8 on stream 0 at 25-May-2021 09:41:10.324 CEST
Begin processing the 778th record. Run 1, Event 778, LumiSection 8 on stream 0 at 25-May-2021 09:41:10.734 CEST
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
Tue May 25 09:41:11 CEST 2021
Thread 2 (Thread 0x7f96dc1d3700 (LWP 4635)):
#0 0x00007f97007e51d9 in waitpid () from /lib64/libpthread.so.0
#1 0x00007f96f39dc387 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2 0x00007f96f39dd01a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3 0x00007f9700dddaf0 in std::execute_native_thread_routine (__p=0x7f96dcaf9520) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007f97007ddea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f97005069fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f96feade540 (LWP 4577)):
#0 0x00007f97004fbccd in poll () from /lib64/libc.so.6
#1 0x00007f96f39dc7b7 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2 0x00007f96f39dd0ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3 0x00007f96f39e059b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f96caf5325a in Herwig::ColourReconnector::_isColour8(ThePEG::Pointer::TransientConstRCPtr<ThePEG::Particle>, ThePEG::Pointer::TransientConstRCPtr<ThePEG::Particle>) const () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#6 0x00007f96caf57976 in Herwig::ColourReconnector::_findPartnerBaryonic(__gnu_cxx::__normal_iterator<ThePEG::Pointer::RCPtr<Herwig::Cluster>*, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > >, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > >&, bool&, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > const&, __gnu_cxx::__normal_iterator<ThePEG::Pointer::RCPtr<Herwig::Cluster>*, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > >&, __gnu_cxx::__normal_iterator<ThePEG::Pointer::RCPtr<Herwig::Cluster>*, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#7 0x00007f96caf587b8 in Herwig::ColourReconnector::_doRecoBaryonic(std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#8 0x00007f96caf44eb0 in Herwig::ClusterHadronizationHandler::handle(ThePEG::EventHandler&, std::vector<ThePEG::Pointer::TransientRCPtr<ThePEG::Particle>, std::allocator<ThePEG::Pointer::TransientRCPtr<ThePEG::Particle> > > const&, ThePEG::Hint const&) () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#9 0x00007f96cc0429b3 in ThePEG::EventHandler::performStep (this=0x7f96c82a7c00, handler=..., hint=...) at EventHandler.cc:196
#10 0x00007f96cc042cca in ThePEG::EventHandler::continueCollision (this=this@entry=0x7f96c82a7c00) at ../include/ThePEG/Pointer/RCPtr.h:879
#11 0x00007f96cbdb4912 in ThePEG::LesHouchesEventHandler::performCollision (this=0x7f96c82a7c00) at LesHouchesEventHandler.cc:334
#12 0x00007f96cbdb739f in ThePEG::LesHouchesEventHandler::generateEvent (this=0x7f96c82a7c00) at LesHouchesEventHandler.cc:256
#13 0x00007f96cbfe9bf6 in ThePEG::EventGenerator::doShoot (this=0x7f96d07e1800) at ../include/ThePEG/Pointer/RCPtr.h:879
#14 0x00007f96cbfe8eab in ThePEG::EventGenerator::shoot (this=0x7f96d07e1800) at EventGenerator.cc:432
#15 0x00007f96cdb43743 in Herwig7Hadronizer::generatePartonsAndHadronize() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginGeneratorInterfaceHerwig7HadronizerPlugins.so
#16 0x00007f96cdb52137 in edm::GeneratorFilter<Herwig7Hadronizer, gen::ExternalDecayDriver>::filter(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginGeneratorInterfaceHerwig7HadronizerPlugins.so
#17 0x00007f9702fe1b4b in edm::one::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#18 0x00007f9702fc700d in edm::WorkerT<edm::one::EDFilterBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#19 0x00007f9702f25995 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#20 0x00007f9702f25b4d in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#21 0x00007f9702f25e56 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#22 0x00007f9702f28440 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#23 0x00007f9702f28681 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#24 0x00007f97031202c9 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#25 0x00007f9701827d0b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=0x7f96fd55be00, t=0x7f96fd54e400, waiter=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/task_dispatcher.h:396
#26 0x00007f97018247e5 in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=0x0, this=0x7f96fd55be00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/task_dispatcher.cpp:178
#27 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/task_dispatcher.cpp:168
#28 0x00007f9702e95d8f in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#29 0x00007f9702e9f115 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#30 0x000000000040bae6 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#31 0x00007f970180c970 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/arena.cpp:674
#32 0x000000000040ca58 in main::{lambda()#1}::operator()() const ()
#33 0x000000000040b62c in main ()
Current Modules:
Module: Herwig7GeneratorFilter:generator (crashed)
A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)
@Dominic-Stafford @agrohsje Would you mind also taking a look at this? Many thanks.
Let me add @theofil . @Dominic-Stafford @theofil will you follow-up ?
Thanks for bringing this up- it's not immediately obvious to me what's going wrong here, but I've started running it to have a look
I've been able to reproduce the error just after "the 3278th record" but I haven't yet found any explanation for it.
Do we know which is the latest release in which the same fragment
was able to be run without problems ?
best, Kostas
On Tue, May 25, 2021 at 12:56 PM Dominic-Stafford @.***> wrote:
Thanks for bringing this up- it's not immediately obvious to me what's going wrong here, but I've started running it to have a look
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/33544#issuecomment-847768595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDV54SUBQDJTMDX7A2TBK3TPN66ZANCNFSM43VGFVPA .
(it would be better to open a new issue for a new segfault, especially if the cause would likely be different)
@theofil Thanks for the test. I don't have idea yet when the issue starts to appear. (It looks odd to me because my first test in the same condition has no error, but then segfault starts to appears regularly at 50-2000 event in my later run.)
@makortel ok, sure.
Workflow 535.0 step 1 crashes in CMSSW_12_0_X_2021-04-27-1100
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_0_X_2021-04-27-1100/pyRelValMatrixLogs/run/535.0_TTbar_13TeV_Pow_herwig7+TTbar_13TeV_Pow_herwig7+HARVESTGEN/step1_TTbar_13TeV_Pow_herwig7+TTbar_13TeV_Pow_herwig7+HARVESTGEN.log#/