cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Segfault in Herwig7GeneratorFilter #33830

Open colizz opened 3 years ago

colizz commented 3 years ago

Hello,

(a new issue following up https://github.com/cms-sw/cmssw/issues/33544)

A segfault appears in Herwig7GeneratorFilter. I use CMSSW_12_0_X_2021-05-24-2300 to test the process JME-RunIISummer20UL16wmLHEGEN-00003. Here are the commands:

cmsrel CMSSW_12_0_X_2021-05-24-2300
cd CMSSW_12_0_X_2021-05-24-2300/src
cmsenv
curl -s -k https://cms-pdmv.cern.ch/mcm/public/restapi/requests/get_fragment/JME-RunIISummer20UL16wmLHEGEN-00003 --retry 3 --create-dirs -o Configuration/GenProduction/python/JME-RunIISummer20UL16wmLHEGEN-00003-fragment.py
[ -s Configuration/GenProduction/python/JME-RunIISummer20UL16wmLHEGEN-00003-fragment.py ] || exit $?;
scram b -j8
cd ../..
cmsDriver.py Configuration/GenProduction/python/JME-RunIISummer20UL16wmLHEGEN-00003-fragment.py --python_filename JME-RunIISummer20UL16wmLHEGEN-00003_1_cfg.py --eventcontent RAWSIM,LHE --customise Configuration/DataProcessing/Utils.addMonitoring --datatier GEN,LHE --fileout file:JME-RunIISummer20UL16wmLHEGEN-00003.root --conditions 112X_mcRun2_asymptotic_v2 --beamspot Realistic25ns13TeV2016Collision --customise_commands process.source.numberEventsInLuminosityBlock="cms.untracked.uint32(101)" --step LHE,GEN --geometry DB:Extended --era Run2_2016 --no_exec --mc -n 5000
cmsRun JME-RunIISummer20UL16wmLHEGEN-00003_1_cfg.py

The segfault occurs randomly when the events are processed at around 500-2000:

...
Begin processing the 776th record. Run 1, Event 776, LumiSection 8 on stream 0 at 25-May-2021 09:41:10.100 CEST
Begin processing the 777th record. Run 1, Event 777, LumiSection 8 on stream 0 at 25-May-2021 09:41:10.324 CEST
Begin processing the 778th record. Run 1, Event 778, LumiSection 8 on stream 0 at 25-May-2021 09:41:10.734 CEST

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue May 25 09:41:11 CEST 2021
Thread 2 (Thread 0x7f96dc1d3700 (LWP 4635)):
#0  0x00007f97007e51d9 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f96f39dc387 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f96f39dd01a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f9700dddaf0 in std::execute_native_thread_routine (__p=0x7f96dcaf9520) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f97007ddea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f97005069fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f96feade540 (LWP 4577)):
#0  0x00007f97004fbccd in poll () from /lib64/libc.so.6
#1  0x00007f96f39dc7b7 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#2  0x00007f96f39dd0ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#3  0x00007f96f39e059b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f96caf5325a in Herwig::ColourReconnector::_isColour8(ThePEG::Pointer::TransientConstRCPtr<ThePEG::Particle>, ThePEG::Pointer::TransientConstRCPtr<ThePEG::Particle>) const () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#6  0x00007f96caf57976 in Herwig::ColourReconnector::_findPartnerBaryonic(__gnu_cxx::__normal_iterator<ThePEG::Pointer::RCPtr<Herwig::Cluster>*, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > >, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > >&, bool&, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > const&, __gnu_cxx::__normal_iterator<ThePEG::Pointer::RCPtr<Herwig::Cluster>*, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > >&, __gnu_cxx::__normal_iterator<ThePEG::Pointer::RCPtr<Herwig::Cluster>*, std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > > >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#7  0x00007f96caf587b8 in Herwig::ColourReconnector::_doRecoBaryonic(std::vector<ThePEG::Pointer::RCPtr<Herwig::Cluster>, std::allocator<ThePEG::Pointer::RCPtr<Herwig::Cluster> > >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#8  0x00007f96caf44eb0 in Herwig::ClusterHadronizationHandler::handle(ThePEG::EventHandler&, std::vector<ThePEG::Pointer::TransientRCPtr<ThePEG::Particle>, std::allocator<ThePEG::Pointer::TransientRCPtr<ThePEG::Particle> > > const&, ThePEG::Hint const&) () from /cvmfs/cms-ib.cern.ch/nweek-02682/slc7_amd64_gcc900/external/herwig7/7.2.2-2bfae0df6f5a8d9801ab4e178064f4d8/lib/Herwig/Herwig.so.27
#9  0x00007f96cc0429b3 in ThePEG::EventHandler::performStep (this=0x7f96c82a7c00, handler=..., hint=...) at EventHandler.cc:196
#10 0x00007f96cc042cca in ThePEG::EventHandler::continueCollision (this=this@entry=0x7f96c82a7c00) at ../include/ThePEG/Pointer/RCPtr.h:879
#11 0x00007f96cbdb4912 in ThePEG::LesHouchesEventHandler::performCollision (this=0x7f96c82a7c00) at LesHouchesEventHandler.cc:334
#12 0x00007f96cbdb739f in ThePEG::LesHouchesEventHandler::generateEvent (this=0x7f96c82a7c00) at LesHouchesEventHandler.cc:256
#13 0x00007f96cbfe9bf6 in ThePEG::EventGenerator::doShoot (this=0x7f96d07e1800) at ../include/ThePEG/Pointer/RCPtr.h:879
#14 0x00007f96cbfe8eab in ThePEG::EventGenerator::shoot (this=0x7f96d07e1800) at EventGenerator.cc:432
#15 0x00007f96cdb43743 in Herwig7Hadronizer::generatePartonsAndHadronize() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginGeneratorInterfaceHerwig7HadronizerPlugins.so
#16 0x00007f96cdb52137 in edm::GeneratorFilter<Herwig7Hadronizer, gen::ExternalDecayDriver>::filter(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/pluginGeneratorInterfaceHerwig7HadronizerPlugins.so
#17 0x00007f9702fe1b4b in edm::one::EDFilterBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#18 0x00007f9702fc700d in edm::WorkerT<edm::one::EDFilterBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#19 0x00007f9702f25995 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#20 0x00007f9702f25b4d in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#21 0x00007f9702f25e56 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#22 0x00007f9702f28440 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#23 0x00007f9702f28681 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#24 0x00007f97031202c9 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreConcurrency.so
#25 0x00007f9701827d0b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (this=0x7f96fd55be00, t=0x7f96fd54e400, waiter=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/task_dispatcher.h:396
#26 0x00007f97018247e5 in tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=0x0, this=0x7f96fd55be00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/task_dispatcher.cpp:178
#27 tbb::detail::r1::task_dispatcher::execute_and_wait (t=0x0, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/task_dispatcher.cpp:168
#28 0x00007f9702e95d8f in edm::EventProcessor::processLumis(std::shared_ptr<void> const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#29 0x00007f9702e9f115 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-05-24-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#30 0x000000000040bae6 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#31 0x00007f970180c970 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_0_0_pre1-slc7_amd64_gcc900/build/CMSSW_12_0_0_pre1-build/BUILD/slc7_amd64_gcc900/external/tbb/v2021.2.0/tbb-v2021.2.0/src/tbb/arena.cpp:674
#32 0x000000000040ca58 in main::{lambda()#1}::operator()() const ()
#33 0x000000000040b62c in main ()

Current Modules:

Module: Herwig7GeneratorFilter:generator (crashed)

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

@theofil @Dominic-Stafford @agrohsje are investigating into this.

cmsbuild commented 3 years ago

A new Issue was created by @colizz Congqiao Li.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 3 years ago

assign generators

cmsbuild commented 3 years ago

New categories assigned: generators

@mkirsano,@SiewYan,@alberto-sanchez,@agrohsje,@GurpreetSinghChahal you have been requested to review this Pull request/Issue and eventually sign? Thanks

agrohsje commented 3 years ago

Hi @theofil , do you have any news on this segmentation violation?

Dominic-Stafford commented 3 years ago

Hi @agrohsje. @theofil and I had a look into this, but couldn't find the cause, though we did find it also occurs outside of CMSSW. We therefore reported the issue to the Herwig authors, who said they would look into it, but haven't got back to us with a solution yet.

agrohsje commented 3 years ago

Thanks @Dominic-Stafford for the update. That's good to know.

agrohsje commented 3 years ago

Any news about this? Maybe we should ping the again. It is now 3 months later. Maybe they forgot about summer.

Dominic-Stafford commented 3 years ago

Hi @agrohsje. Thank you for the reminder, I'd also forgotten about this. I've sent Patrick Kirschgaesser, who was looking at this, another email

Dominic-Stafford commented 3 years ago

The response from Patrick is that for the problematic events in the hadronisation step a cluster is formed by two quarks, rather than a quark and an anti-quark. This seems to be because the colour connections for this event in the Madgraph lhe are incorrect. @colizz do you know of any unusual settings in this madgraph gridpack which may be causing this? If not the Herwig authors also said they'd investigate further from their side

agrohsje commented 3 years ago

Hi @Dominic-Stafford . Thanks for following up. Would it be possible to produce a LHE file (as text file) with 1000 events and use it as input to check if the problem appears and then cross check with the LHE. I am aware that MG5 can drop color info (like axion-gluon interferences) but this case seems odd to me.

Dominic-Stafford commented 3 years ago

Hi @agrohsje. I have put an lhe with 2500 events which produces the crash here: /afs/desy.de/user/s/stafford/tmp/cmsgrid_final.lhe, however I'm not sure which of the events produces the crash- I could try getting herwig to print the number of the problematic event, but this would be quite fiddly due to events being skipped in the merging.

theofil commented 2 years ago

Appending here yet another occurrence of the same issue

https://cms-pdmv.cern.ch/mcm/requests?prepid=JME-RunIISummer20UL16wmLHEGEN-00090&page=0&shown=127

https://cms-talk.web.cern.ch/t/jme-local-validation-failed-for-the-madgraphmlm-herwig7-sample/4896?u=theofil

kirschen commented 2 years ago

Hi all, We are still stuck with our JME request for Madgraph+Herwig samples that Kostas mentioned in January. Has there been some progress and/or can we revive the issue here? (also adding @kdlong, @gqlcms)

Cheers, Henning

Dominic-Stafford commented 2 years ago

Hi, sorry for the delay. We've contacted the Herwig authors multiple times about this, but haven't yet had a conclusive response. For now the best solution we have would be to try downgrading to Herwig 7.1.4, for which we don't think this issue occurred, which we could do by running with CMSSW_10_6_12. I'm currently trying to get this set-up to work, but I've encountered what I think is an unrelated bug- I'll let you know when I have a working set-up. Would this solution work for you, or do you foresee any problems from using this older CMSSW version?

Dominic-Stafford commented 2 years ago

By comparing with a previous request [1], I found the segfault I was experiencing with Herwig 7.1.4 was coming from the Madgraph gridpack - the gridpack from this request [2] caused Herwig to segfault as soon as it was called, while the one from the previous request [3] ran fine. I then tried running this older gridpack in 7.2, and found it didn't cause the segfault described in this thread, either (at least for my 40 tests of 5000 events each). The proc and run cards for these gridpacks seem to be identical, the only difference is the CMSSW and madgraph versions. I don't know if reproducing the gridpack with the latest version of CMSSW would fix the issue, but I would suggest trying this, and if this doesn't fix it submitting this request with the old gridpack, if the lower madgraph version wouldn't cause problems.

[1] https://cms-pdmv.cern.ch/mcm/public/restapi/requests/get_test/JME-RunIIFall18wmLHEGS-00029 [2] /cvmfs/cms.cern.ch/phys_generator/gridpacks/2017/13TeV/madgraph/V5_2.6.5/QCD_HT_LO_MLM/QCD_HT100to200_slc6_amd64_gcc630_CMSSW_9_3_16_tarball.tar.xz [3] /cvmfs/cms.cern.ch/phys_generator/gridpacks/2017/13TeV/madgraph/V5_2.4.2/QCD_HT100to200/v1/QCD_HT100to200_slc6_amd64_gcc481_CMSSW_7_1_30_tarball.tar.xz

agrohsje commented 2 years ago

Hi @Dominic-Stafford. Can you send the exact setup to reproduce?

Dominic-Stafford commented 2 years ago

Hi @agrohsje. I've put the cfg files I was using here: /afs/desy.de/user/s/stafford/public/herwig_madgraph_col_recon_seg_fault- there is an example for CMSSW_10_6_12 (using herwig 7.1.4), and one for CMSSW_10_6_30_patch1 (using herwig 7.2). Both currently have the older (madgraph 2.4.2) gridpack enabled, but I've included the new (v2.6.2) one as a comment in the external_lhe_producer so one can easily swap them over. In CMSSW_10_6_12 the difference is very obvious as the new one immediately seg faults when herwig starts, while the old one doesn't, but in CMSSW_10_6_30_patch1 it's a bit harder to tell since the seg fault is intermittent- in my tests I got 10 out of 100 jobs with 5000 events each failed with the new gridpack, while all 100 ran successfully with the old one

HerrHorizontal commented 2 years ago

Hi @Dominic-Stafford, could you point me to one of the problematic events in the LHE file format? I am currently in contact with Stefan Gieseke, and he would be interested to have a closer look into this. Could you also point me to the version of MadGraph used to produce this event?

Dominic-Stafford commented 2 years ago

Hi @HerrHorizontal. This lhe [1] with 15 events is the smallest I've been able to produce- I've tried further splitting it, but removing the last event makes it run, as does running the last event on its own, so it could be the issue is caused by some sort of buffer overflow in Herwig. This lhe was produced with madgraph v2.6.5 (though as I mentioned, the issue also occurs for 2.6.2). If Stefan Gieske is able to look into this that would be great.

[1] her_col_recon_seg_fault.lhe.txt (Please ignore the .txt extension, I just added this to make github accept it as an attachment)

HerrHorizontal commented 2 years ago

Just a short status update: Stefan and I checked the colour connections in the event file you have sent and also in the one the Herwig authors got a few months ago enclosed with the issue report. So the colour structure in all the events in both files seems to be okay.

HerrHorizontal commented 2 years ago

Just a short status update: Stefan and I checked the colour connections in the event file you have sent and also in the one the Herwig authors got a few months ago enclosed with the issue report. So the colour structure in all the events in both files seems to be okay.

In case you are interested, how we checked it:

grep "50[0-9] " cmsgrid_final.lhe | awk '{sum += $2*($5-$6)} END {print sum}'

When all event contributions sum up to 0 everything is fine.

HerrHorizontal commented 2 years ago

Hi @kirschen, @agrohsje, @Dominic-Stafford, Stefan observed that in the LHE file, events with anti-coloured quarks and coloured anti-quarks lead to clusters of two quarks and anti-quarks. This of course is nonsense, and Herwig fails at this point. The issue is therefore in the ME-provider code, which produces globally unsuspicious events, but assigns wrong colours to the individual partons, which we overlooked in the first check.

Dominic-Stafford commented 2 years ago

Hi @HerrHorizontal, thanks for getting to the bottom of this. Will you or Stefan report this issue to the Madgraph authors so they can produce a fix? If not I can do this if you send me full details of an example. For this specific request, if we don't want to wait for a fix from the Madgraph authors I guess we could write a script that goes through and deletes the offending events in the lhe before herwig runs over them, unless anyone has any better suggestions?

HerrHorizontal commented 2 years ago

Hey @Dominic-Stafford so Stefan proposes that someone in CMS contacts the MadGraph authors with a view to used MG version and other CMS specific settings.

To identify the problematic events, I guess you could just check the sign of the PID of the quark in the event file and see if it has an associated colour or anti-colour. E.g. in this event:

 5      1 +5.1051472e+03 9.62799900e+01 7.54677100e-03 1.17041500e-01
       21 -1    0    0  501  503 +0.0000000000e+00 +0.0000000000e+00 +1.2926751321e+02 1.2926751321e+02 0.0000000000e+00 0.0000e+00 1.0000e+00
        5 -1    0    0  502    0 -0.0000000000e+00 -0.0000000000e+00 -2.2829382665e+02 2.2829382665e+02 0.0000000000e+00 0.0000e+00 -1.0000e+00
       -2  1    1    2  501    0 -5.8761279817e+01 -8.0882308940e+01 -1.8760620385e+02 2.1258156935e+02 0.0000000000e+00 0.0000e+00 -1.0000e+00
        5  1    1    2  502    0 +2.1251537327e+01 +7.2317232530e+01 +6.7099010829e+01 1.0091425674e+02 0.0000000000e+00 0.0000e+00 1.0000e+00
        2  1    1    2    0  503 +3.7509742490e+01 +8.5650764094e+00 +2.1480879586e+01 4.4065513765e+01 0.0000000000e+00 0.0000e+00 -1.0000e+00

quark 5 has a colour 502 assigned, but anti-quark -2 has a colour 501 and quark 2 an anti-colour 503. So this event will be problematic.

HerrHorizontal commented 2 years ago

@Dominic-Stafford concerning your question, whether it is a good idea to filter these kinds of events, I would say, that it is only a good idea, when this issue is uniformly distributed across all initial and final state flavours and phase-space or statistically insignificant. Otherwise, you might end up with a sample distributed with a bias on generation level.

theofil commented 2 years ago

just out of curiosity, how the same LHE events are handled when they are passed to other hadronizers, e.g., to Pythia ? do they result into a crash as they do with Herwig ?

On Wed, May 25, 2022 at 4:59 PM Maximilian Horzela @.***> wrote:

@Dominic-Stafford https://github.com/Dominic-Stafford concerning your question, whether it is a good idea to filter these kinds of events, I would say, that it is only a good idea, when this issue is uniformly distributed across all initial and final state flavours and phase-space or statistically insignificant. Otherwise, you might end up with a sample distributed with a bias on generation level.

— Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/33830#issuecomment-1137392141, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDV54WALDI6SG7HQ4H4JWTVLY55BANCNFSM45PMVTYA . You are receiving this because you were mentioned.Message ID: @.***>

Dominic-Stafford commented 2 years ago

OK, I can submit this issue on the madgraph launchpad, unless you would like to do this. The issue is seemingly fairly rare (occurring for maybe 1 in 500 events), but I'll also ask the madgraph authors if they think this will have a biasing effect before implementing it.

HerrHorizontal commented 2 years ago

OK, I can submit this issue on the madgraph launchpad, unless you would like to do this. The issue is seemingly fairly rare (occurring for maybe 1 in 500 events), but I'll also ask the madgraph authors if they think this will have a biasing effect before implementing it.

That would be nice, thanks. I think you have better knowledge about all the MadGraph subtleties used in CMS. Could you link my GitHub account to the issue, so I get notified, when they answer?

Dominic-Stafford commented 2 years ago

Sorry, I'm not sure how to add you to a launchpad issue, but it's here and I think you can sign up for email updates if you like: https://bugs.launchpad.net/mg5amcnlo/+bug/1975733

gqlcms commented 2 years ago

Hi @Dominic-Stafford,

Thank you very much for fixing this! May I ask if you have a patch for our CMS 2.6.5 version?

Thanks

Dominic-Stafford commented 2 years ago

Hi @gqlcms

Yes, I've created a PR to genproductions with the patch here: https://github.com/cms-sw/genproductions/pull/3193

Dominic-Stafford commented 2 years ago

This patch is now merged, so if you create new gridpacks with the latest version of the genproductions repository you should no longer encounter this issue

makortel commented 1 month ago

@cms-sw/generators-l2 Can this issue be signed and closed?

cmsbuild commented 1 month ago

cms-bot internal usage