Open iarspider opened 9 months ago
A new Issue was created by @iarspider .
@smuzaffar, @makortel, @rappoccio, @antoniovilela, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign GeneratorInterface/Pythia6Interface
New categories assigned: generators
@alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks
The problem is LHAPDF uses thread locals to store its 'cached' values. In particular there is a container ACTIVESETS
defined in LHAGlue.cc
/// Collection of active sets
static thread_local map<int, PDFSetHandler> ACTIVESETS;
In our code, this gets initialized at begin LuminosityBlock by the call to pdfset_
via the call stack
#0 0x00007fffc09a6b20 in pdfset_ () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#1 0x00007fffc288b4a6 in pyinit (frame=..., beam=..., target=..., win=8000, _frame=3, _beam=1, _target=1) at pyinit.f:115
#2 0x00007fffc2c63aaf in gen::Pythia6Hadronizer::initializeForInternalPartons() ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
#3 0x00007fffc2c64a60 in edm::GeneratorFilter<gen::Pythia6Hadronizer, gen::ExternalDecayDriver>::beginLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
In the job I was running, this happened on thread 5.
Then later in the job, a call to PDFSetHandler::member(int)
is made on Thread 1 which leads to the exception being thrown with the call stack
#0 0x00007ffff5bc92f1 in __cxxabiv1::__cxa_throw (obj=0x7ffff0de36a0, tinfo=0x7fffc09ca818 <typeinfo for LHAPDF::ReadError>, dest=0x7fffc0927a90 <LHAPDF::ReadError::~ReadError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1 0x00007fffc0909425 in LHAPDF::PDFSet::PDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold] ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#2 0x00007fffc0940cef in LHAPDF::getPDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#3 0x00007fffc094214d in LHAPDF::mkPDF(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#4 0x00007fffc09a0aed in (anonymous namespace)::PDFSetHandler::loadMember(int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#5 0x00007fffc09a0c56 in (anonymous namespace)::PDFSetHandler::member(int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#6 0x00007fffc09a3abb in structm_ () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#7 0x00007fffc28bd1cf in pypdfu (kf=2212, x=<optimized out>, q2=<optimized out>, xpq=...) at pypdfu.f:356
#8 0x00007fffc290553c in pysigh (nchn=0, sigs=0) at pysigh.f:420
#9 0x00007fffc28d4e7a in pyrand () at pyrand.f:1005
#10 0x00007fffc287ea1b in pyevnw () at pyevnw.f:111
#11 0x00007fffc2c6310a in gen::Pythia6Hadronizer::generatePartonsAndHadronize() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
#12 0x00007fffc2c66a01 in edm::GeneratorFilter<gen::Pythia6Hadronizer, gen::ExternalDecayDriver>::filter(edm::Event&, edm::EventSetup const&) ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
The problem is, the call to structm_
assumes that the ACTIVESETS
was already initialized on this thread but we only call pyinit
once on the thread that happens to call begin LuminosityBlock on the generator module.
Tagging a related LHAPDF update https://github.com/cms-sw/cmsdist/pull/8852
This was most likely caused by https://github.com/cms-sw/cmsdist/pull/8852
So it looks like the FORTRAN adapter layer for LHAPDF assumes all FORTRAN calls will occur on one and only one thread (or at least all FORTRAN calls will happen independently on each thread).
I can confirm that the change from LHAPDF 6.4.0 to 6.5.4 is the reason. Between those two version there was the change from
static map<int, PDFSetHandler> ACTIVESETS;
to
static thread_local map<int, PDFSetHandler> ACTIVESETS;
Probably the only safe change would be to modify GeneratorFilter
so it uses its own thread for all calls to FORTRAN. This is similar to how we handed the problems with Geant4's thread local use.
Would it be feasible to give feedback that this change didn't work for us?
I can undertake to communicate the problem to the authors
@mkirsano , should we revert the lhapdf change ?
I am trying to check if we can revert only lhapdf library.
I checked that it is possible to use new pdf sets with lhapdf 6.4.0 (checked on pythia8, Generator step, 1 thread). So I suggest to try to move back to lhapdf 6.4.0 and have only sets upgraded for the moment.
https://github.com/cms-sw/cmsdist/pull/8869 reverts LHAPDF to 6.4.0 while keeping the pdf sets to 6.5.1
The problem is reported to lhapdf-support. Will write also to the main developer.
This is the answer from lhapdf developers:
The initialization needs to be called explicitly on each thread: they cannot inherit their memory from the calling process, as shared memory is open to run conditions and cross-thread corruptions. So where you say that you only call pyinit once, you should change the calling code to do so thread-locally! Prof Andy Buckley MSc Convener, School of Physics & Astronomy Particle Physics Experiment Group, University of Glasgow, UK
Thanks @mkirsano, sounds like the way forward is then what @Dr15Jones suggested in https://github.com/cms-sw/cmssw/issues/43526#issuecomment-1847467513.
(and for what it's worth, from our standpoint this is a terrible way to handle multithreading)
Workflow 561.0 crashed in CMSSW_14_0_X_2023-12-13-2300 on el9_amd64_gcc12
Thread 1 (Thread 0x148801d8c640 (LWP 1351141) "cmsRun"):
#3 0x00001487fe3e17b0 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00001487c690d31c in LHAPDF::LogBicubicInterpolator::_interpolateXQ2(LHAPDF::KnotArray const&, double, unsigned long, double, unsigned long, int) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libLHAPDF.so
#6 0x00001487c68f92ca in LHAPDF::PDF::xfxQ2(int, double, double) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libLHAPDF.so
#7 0x00001487aae34879 in Pythia8::LHAPDF6::xfUpdate(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8lhapdf6.so
#8 0x00001487c6f213b9 in Pythia8::LHAPDF::xfUpdate(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#9 0x00001487c6f24bae in Pythia8::PDF::xf(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#10 0x00001487c6ee9ce4 in Pythia8::MultipartonInteractions::upperEnvelope() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#11 0x00001487c6ef9ace in Pythia8::MultipartonInteractions::init(bool, int, Pythia8::BeamParticle*, Pythia8::BeamParticle*, std::shared_ptr<Pythia8::PartonVertex>, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#12 0x00001487c6f3b961 in Pythia8::PartonLevel::init(std::shared_ptr<Pythia8::TimeShower>, std::shared_ptr<Pythia8::TimeShower>, std::shared_ptr<Pythia8::SpaceShower>, Pythia8::RHadrons*, std::shared_ptr<Pythia8::MergingHooks>, std::shared_ptr<Pythia8::PartonVertex>, std::shared_ptr<Pythia8::StringInteractions>, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#13 0x00001487c6fc6ef4 in Pythia8::Pythia::init() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#14 0x00001487c87a56f1 in Pythia8Hadronizer::initializeForInternalPartons() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#15 0x00001487c87c3917 in edm::ConcurrentGeneratorFilter<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>::initLumi(edm::gen::GenStreamCache<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>*, edm::LuminosityBlock const&, edm::EventSetup const&) const [clone .constprop.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#16 0x00001487c87be701 in edm::ConcurrentGeneratorFilter<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>::globalBeginLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#17 0x000014880444e2cd in edm::global::EDFilterBase::doBeginLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so
Current Modules:
Module: Pythia8ConcurrentGeneratorFilter:generator (crashed)
Module: none
Module: none
Module: none
I'm not sure if it is related to this issue, or (way) earlier crashes https://github.com/cms-sw/cmssw/issues/35082 https://github.com/cms-sw/cmssw/issues/35251
Since CMSSW_14_0_X_2023-12-04-1100 RelVal 132.0 fails for all flavors/architectures:
Indeed, the code tries to open files like
/_0000.dat
,/.info
or even""
(empty string). This is the origin of one such call (extracted with strace -f -k`: https://gist.github.com/iarspider/fdd17a07202e6ff4bb884f3751e37cd5This seems to be a threading issue: running this relVal with single thread doesn't generate an exception, while running with 2+ threads does.