cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.05k stars 4.23k forks source link

Infinite recursion seen in TFile::Open #44997

Open Dr15Jones opened 2 weeks ago

Dr15Jones commented 2 weeks ago

A production job at RAL exceeded 70GB of memory and was killed by the site. A subsequent change to stop 'lazy download' appears to have avoided the problem.

Looking at the log file from one failed job, the trace back shows all threads but 1 waiting, where the only active thread looks like

14-May-2024 07:55:48 UTC  Initiating request to open file root://xrootd.echo.stfc.ac.uk//store/mc/RunIISummer20ULPrePremix/Neutrino_E-10_gun/PREMIX/UL18_106X_upgrade2018_realistic_v11_L1v1-v2/230003/012B7EF9-8FD3-6340-9AFF-5DD7830C2C60.root

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

...

#4  <signal handler called>
#5  0x000014c44b7faaf5 in Matchs(char const*, unsigned long, unsigned short const*, char const**) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#6  0x000014c44b84f83e in TRegexp::Index(TString const&, int*, int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#7  0x000014c44b84f8dc in TString::Index(TRegexp const&, int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#8  0x000014c44b8407a4 in TPluginHandler::CanHandle(char const*, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#9  0x000014c44b840886 in TPluginManager::FindHandler(char const*, char const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libCore.so
#10 0x000014c44bd27c2d in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#11 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#13 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
...
#12267140 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267141 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267142 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so
#12267143 0x000014c44bd28997 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw-patch/CMSSW_10_6_17_patch1/external/slc7_amd64_gcc700/lib/libRIO.so

note that the debugger just gave up after going 12 million deep in the stack (so the stack was likely larger).

This job was using CMSSW_10_6_17_patch1 and running

cmsbuild commented 2 weeks ago

cms-bot internal usage

cmsbuild commented 2 weeks ago

A new Issue was created by @Dr15Jones.

@makortel, @sextonkennedy, @smuzaffar, @antoniovilela, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 2 weeks ago

The full log information can be found

https://cernbox.cern.ch/s/CQvjzkONaprof65 you should be able to find the WMTaskSpace dir in the vocms0253.cern.ch-283595-1 tarball.

Dr15Jones commented 2 weeks ago

assign root

makortel commented 2 weeks ago

assign core

makortel commented 2 weeks ago

type root

makortel commented 2 weeks ago

@pcanal Please take a look

cmsbuild commented 2 weeks ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 2 weeks ago

As a guess, I'd say the recursion is from this line

https://github.com/root-project/root/blob/88c9a5e76f7e118e7c8a3edc1eb9cf6128b5e0a0/io/io/src/TFile.cxx#L4289

Dr15Jones commented 2 weeks ago

The file that appears to be being opened is a premix file. As far as I can tell, the Embedded root source (used by the mixing module) does NOT make use of the lazy-download system as that code does not call storage::StorageFactory::get()->stagein(...)

Dr15Jones commented 2 weeks ago

The recursion could also happen here:

https://github.com/root-project/root/blob/88c9a5e76f7e118e7c8a3edc1eb9cf6128b5e0a0/io/io/src/TFile.cxx#L4294-L4299

as TNetXNGFile does have the ability to return a non blank GetNewUrl.

pcanal commented 2 weeks ago

Indeed both places are missing a test that the new attempt is different from the current ...

pcanal commented 1 week ago

See https://github.com/root-project/root/issues/15590 and https://github.com/root-project/root/pull/15591

makortel commented 1 week ago

Do we want to backport the fix, and if so, how far back? 10_6_X (where the issue occurred) might be too far back, but maybe 14_0_X would make sense?