cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

RelVal 132.0: (PDF) Info file not found #43526

Open iarspider opened 9 months ago

iarspider commented 9 months ago

Since CMSSW_14_0_X_2023-12-04-1100 RelVal 132.0 fails for all flavors/architectures:

----- Begin Fatal Exception 04-Dec-2023 16:45:52 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 2
   [1] Running path 'simulation_step'
   [2] Calling method for module Pythia6GeneratorFilter/'generator'
Exception Message:
A std::exception was thrown.
Info file not found for PDF set ''
----- End Fatal Exception -------------------------------------------------

Indeed, the code tries to open files like /_0000.dat, /.info or even "" (empty string). This is the origin of one such call (extracted with strace -f -k`: https://gist.github.com/iarspider/fdd17a07202e6ff4bb884f3751e37cd5

This seems to be a threading issue: running this relVal with single thread doesn't generate an exception, while running with 2+ threads does.

cmsbuild commented 9 months ago

A new Issue was created by @iarspider .

@smuzaffar, @makortel, @rappoccio, @antoniovilela, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

iarspider commented 9 months ago

assign GeneratorInterface/Pythia6Interface

cmsbuild commented 9 months ago

New categories assigned: generators

@alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 9 months ago

The problem is LHAPDF uses thread locals to store its 'cached' values. In particular there is a container ACTIVESETS defined in LHAGlue.cc

 /// Collection of active sets
  static thread_local map<int, PDFSetHandler> ACTIVESETS;

In our code, this gets initialized at begin LuminosityBlock by the call to pdfset_ via the call stack

#0  0x00007fffc09a6b20 in pdfset_ () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#1  0x00007fffc288b4a6 in pyinit (frame=..., beam=..., target=..., win=8000, _frame=3, _beam=1, _target=1) at pyinit.f:115
#2  0x00007fffc2c63aaf in gen::Pythia6Hadronizer::initializeForInternalPartons() ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
#3  0x00007fffc2c64a60 in edm::GeneratorFilter<gen::Pythia6Hadronizer, gen::ExternalDecayDriver>::beginLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so

In the job I was running, this happened on thread 5.

Then later in the job, a call to PDFSetHandler::member(int) is made on Thread 1 which leads to the exception being thrown with the call stack

#0  0x00007ffff5bc92f1 in __cxxabiv1::__cxa_throw (obj=0x7ffff0de36a0, tinfo=0x7fffc09ca818 <typeinfo for LHAPDF::ReadError>, dest=0x7fffc0927a90 <LHAPDF::ReadError::~ReadError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffc0909425 in LHAPDF::PDFSet::PDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold] ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#2  0x00007fffc0940cef in LHAPDF::getPDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#3  0x00007fffc094214d in LHAPDF::mkPDF(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#4  0x00007fffc09a0aed in (anonymous namespace)::PDFSetHandler::loadMember(int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#5  0x00007fffc09a0c56 in (anonymous namespace)::PDFSetHandler::member(int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#6  0x00007fffc09a3abb in structm_ () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#7  0x00007fffc28bd1cf in pypdfu (kf=2212, x=<optimized out>, q2=<optimized out>, xpq=...) at pypdfu.f:356
#8  0x00007fffc290553c in pysigh (nchn=0, sigs=0) at pysigh.f:420
#9  0x00007fffc28d4e7a in pyrand () at pyrand.f:1005
#10 0x00007fffc287ea1b in pyevnw () at pyevnw.f:111
#11 0x00007fffc2c6310a in gen::Pythia6Hadronizer::generatePartonsAndHadronize() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
#12 0x00007fffc2c66a01 in edm::GeneratorFilter<gen::Pythia6Hadronizer, gen::ExternalDecayDriver>::filter(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so

The problem is, the call to structm_ assumes that the ACTIVESETS was already initialized on this thread but we only call pyinit once on the thread that happens to call begin LuminosityBlock on the generator module.

makortel commented 9 months ago

Tagging a related LHAPDF update https://github.com/cms-sw/cmsdist/pull/8852

Dr15Jones commented 9 months ago

This was most likely caused by https://github.com/cms-sw/cmsdist/pull/8852

Dr15Jones commented 9 months ago

So it looks like the FORTRAN adapter layer for LHAPDF assumes all FORTRAN calls will occur on one and only one thread (or at least all FORTRAN calls will happen independently on each thread).

Dr15Jones commented 9 months ago

I can confirm that the change from LHAPDF 6.4.0 to 6.5.4 is the reason. Between those two version there was the change from

 static map<int, PDFSetHandler> ACTIVESETS;

to

static thread_local map<int, PDFSetHandler> ACTIVESETS;
Dr15Jones commented 9 months ago

Probably the only safe change would be to modify GeneratorFilter so it uses its own thread for all calls to FORTRAN. This is similar to how we handed the problems with Geant4's thread local use.

makortel commented 9 months ago

Would it be feasible to give feedback that this change didn't work for us?

mkirsano commented 9 months ago

I can undertake to communicate the problem to the authors

smuzaffar commented 9 months ago

@mkirsano , should we revert the lhapdf change ?

mkirsano commented 9 months ago

I am trying to check if we can revert only lhapdf library.

mkirsano commented 9 months ago

I checked that it is possible to use new pdf sets with lhapdf 6.4.0 (checked on pythia8, Generator step, 1 thread). So I suggest to try to move back to lhapdf 6.4.0 and have only sets upgraded for the moment.

smuzaffar commented 9 months ago

https://github.com/cms-sw/cmsdist/pull/8869 reverts LHAPDF to 6.4.0 while keeping the pdf sets to 6.5.1

mkirsano commented 9 months ago

The problem is reported to lhapdf-support. Will write also to the main developer.

mkirsano commented 9 months ago

This is the answer from lhapdf developers:

The initialization needs to be called explicitly on each thread: they cannot inherit their memory from the calling process, as shared memory is open to run conditions and cross-thread corruptions. So where you say that you only call pyinit once, you should change the calling code to do so thread-locally! Prof Andy Buckley MSc Convener, School of Physics & Astronomy Particle Physics Experiment Group, University of Glasgow, UK

makortel commented 9 months ago

Thanks @mkirsano, sounds like the way forward is then what @Dr15Jones suggested in https://github.com/cms-sw/cmssw/issues/43526#issuecomment-1847467513.

(and for what it's worth, from our standpoint this is a terrible way to handle multithreading)

makortel commented 9 months ago

Workflow 561.0 crashed in CMSSW_14_0_X_2023-12-13-2300 on el9_amd64_gcc12

Thread 1 (Thread 0x148801d8c640 (LWP 1351141) "cmsRun"):
#3  0x00001487fe3e17b0 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00001487c690d31c in LHAPDF::LogBicubicInterpolator::_interpolateXQ2(LHAPDF::KnotArray const&, double, unsigned long, double, unsigned long, int) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libLHAPDF.so
#6  0x00001487c68f92ca in LHAPDF::PDF::xfxQ2(int, double, double) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libLHAPDF.so
#7  0x00001487aae34879 in Pythia8::LHAPDF6::xfUpdate(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8lhapdf6.so
#8  0x00001487c6f213b9 in Pythia8::LHAPDF::xfUpdate(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#9  0x00001487c6f24bae in Pythia8::PDF::xf(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#10 0x00001487c6ee9ce4 in Pythia8::MultipartonInteractions::upperEnvelope() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#11 0x00001487c6ef9ace in Pythia8::MultipartonInteractions::init(bool, int, Pythia8::BeamParticle*, Pythia8::BeamParticle*, std::shared_ptr<Pythia8::PartonVertex>, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#12 0x00001487c6f3b961 in Pythia8::PartonLevel::init(std::shared_ptr<Pythia8::TimeShower>, std::shared_ptr<Pythia8::TimeShower>, std::shared_ptr<Pythia8::SpaceShower>, Pythia8::RHadrons*, std::shared_ptr<Pythia8::MergingHooks>, std::shared_ptr<Pythia8::PartonVertex>, std::shared_ptr<Pythia8::StringInteractions>, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#13 0x00001487c6fc6ef4 in Pythia8::Pythia::init() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#14 0x00001487c87a56f1 in Pythia8Hadronizer::initializeForInternalPartons() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#15 0x00001487c87c3917 in edm::ConcurrentGeneratorFilter<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>::initLumi(edm::gen::GenStreamCache<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>*, edm::LuminosityBlock const&, edm::EventSetup const&) const [clone .constprop.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#16 0x00001487c87be701 in edm::ConcurrentGeneratorFilter<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>::globalBeginLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#17 0x000014880444e2cd in edm::global::EDFilterBase::doBeginLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so

Current Modules:
Module: Pythia8ConcurrentGeneratorFilter:generator (crashed)
Module: none
Module: none
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el9_amd64_gcc12/CMSSW_14_0_X_2023-12-13-2300/pyRelValMatrixLogs/run/561.0_WprimeTolNu_M3000_13TeV_pythia8/step1_WprimeTolNu_M3000_13TeV_pythia8.log#/

I'm not sure if it is related to this issue, or (way) earlier crashes https://github.com/cms-sw/cmssw/issues/35082 https://github.com/cms-sw/cmssw/issues/35251