Open mpresill opened 1 month ago
cms-bot internal usage
A new Issue was created by @mpresill.
@makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace
Thread 1 (Thread 0x153d50106640 (LWP 1030) "cmsRun"):
#3 0x0000153d4a5c2720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000153d50937a63 in TDirectoryFile::WriteKeys() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6 0x0000153d509348bc in TDirectoryFile::SaveSelf(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7 0x0000153d50932d27 in TDirectoryFile::Save() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8 0x0000153d50932bfc in TDirectoryFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9 0x0000153d509506d6 in TFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x0000153d4a4d013e in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x0000153d4a4d0169 in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x0000153cab9c0a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x0000153cab9c38e5 in edm::RootOutputFile::~RootOutputFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x0000153cab9c4deb in edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x0000153cab9c51b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x0000153cab9fa6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x0000153cab9fa811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x0000153d513eccb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000153d513f6d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x0000153d51354749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x0000153d51346df7 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()
Symptoms are the same as in https://github.com/cms-sw/cmssw/issues/40132#issuecomment-1323831885 .
Was any of the jobs re-tried in Tier0?
FYI @pcanal
That is strange. It reports problem with the TFile
object itself. I.e. one potential cause is that the TFile
was already closed (or is being closed by another thread).
I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace
Thread 1 (Thread 0x153d50106640 (LWP 1030) "cmsRun"): #3 0x0000153d4a5c2720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so #4 <signal handler called> #5 0x0000153d50937a63 in TDirectoryFile::WriteKeys() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so #6 0x0000153d509348bc in TDirectoryFile::SaveSelf(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so #7 0x0000153d50932d27 in TDirectoryFile::Save() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so #8 0x0000153d50932bfc in TDirectoryFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so #9 0x0000153d509506d6 in TFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so #10 0x0000153d4a4d013e in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so #11 0x0000153d4a4d0169 in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so #12 0x0000153cab9c0a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so #13 0x0000153cab9c38e5 in edm::RootOutputFile::~RootOutputFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so #14 0x0000153cab9c4deb in edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so #15 0x0000153cab9c51b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so #16 0x0000153cab9fa6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so #17 0x0000153cab9fa811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so #18 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so #19 0x0000153d513eccb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so #20 0x0000153d513f6d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so #21 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so #22 0x0000153d51354749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so #23 0x0000153d51346df7 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so #24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() () #25 0x0000000000405199 in main ()
Symptoms are the same as in #40132 (comment) .
Was any of the jobs re-tried in Tier0?
It was not retried yet. Should this be re-tried?
Was any of the jobs re-tried in Tier0?
It was not retried yet. Should this be re-tried?
Probably not worth it. I tested the job locally (with a local input file), and it fails in the same way.
I noticed many printouts from PileupJetIdProducer
in the log (being responsible of 767 MB of the 844 MB of the cmsRun1-stdout.log
). I opened a separate issue https://github.com/cms-sw/cmssw/issues/45099 about that.
FYI, there is new failed job for the same run and workflow _PromptReco_Run381067JetMET0 with same error.
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FatalRootError/ByteCountTooLarge/vocms013.cern.ch-3587397-3-log.tar.gz
Following https://github.com/cms-sw/cmssw/issues/40132#issuecomment-1933671795 I tested 14_0_7 with the backport of that (https://github.com/cms-sw/cmssw/issues/40132#issuecomment-2141532075), but the behavior was the same
01-Jun-2024 00:21:10 CEST Closed file file:e186cde3-4166-4946-9343-f0c908376153.root
----- Begin Fatal Exception 01-Jun-2024 00:21:20 CEST-----------------------
An exception of category 'FatalRootError' occurred while
[0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
Additional Info:
[a] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)
----- End Fatal Exception -------------------------------------------------
<cut>
Thread 1 (Thread 0x7f98df012680 (LWP 316619) "cmsRun"):
#3 0x00007f98dab6a720 in sig_dostack_then_abort () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f98e1e0a48a in TDirectoryFile::WriteKeys (this=0x7f98caf31580) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:2187
#6 0x00007f98e1e08228 in TDirectoryFile::SaveSelf (this=0x7f98caf31580, force=false) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1634
#7 0x00007f98e1e07c3c in TDirectoryFile::Save (this=0x7f98caf31580) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1552
#8 0x00007f98e1e03c59 in TDirectoryFile::Close (this=0x7f98caf31580, option=0x7f98da1a6a26 "") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:568
#9 0x00007f98e1e1d640 in TFile::Close (this=0x7f98caf31580, option=0x7f98da1a6a26 "") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TFile.cxx:970
#10 0x00007f98da1a413e in TStorageFactoryFile::~TStorageFactoryFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x00007f98da1a4169 in TStorageFactoryFile::~TStorageFactoryFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x00007f9840175a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x00007f98401788e5 in edm::RootOutputFile::~RootOutputFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x00007f9840179deb in edm::PoolOutputModule::~PoolOutputModule() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x00007f984017a1b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x00007f98401af6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x00007f98401af811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x00007f98e2bc72a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f98e2c89cb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f98e2c93d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00007f98e2bc72a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x00007f98e2bf1749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x00007f98e2be3df7 in edm::EventProcessor::~EventProcessor() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()
The first error in each job is the one reported by the (CMSSW) exception message, i.e.
* `Request to expand to a negative size, likely due to an integer overflow: 0x8001588e for a max of 0x7ffffffe.`
Please note that https://github.com/root-project/root/pull/14627 should only solve the bug [1] from https://github.com/cms-sw/cmssw/issues/40132#issuecomment-1933671795: (request to expand to a negative size).
The exception "TBufferFile::WriteByteCount bytecount too large (more than 1073741822)", ie bug [2] in the other issue, is not a bug, but rather a wanted exception, which tells you that your TFile contains a key that exceeds the maximum allowed size of 1 GB (https://github.com/root-project/root/issues/6734)
Workarounds would be to address https://github.com/root-project/root/issues/6734, or to make your object a bit smaller, or to reroot the exception with a custom error handler so that it does not throw a fatal error.
type root
to reroot the exception with a custom error handler so that it does not throw a fatal error.
What exactly would the ROOT state be then at the point where it issues the error message bytecount too large
but the execution would continue without the conversion of the error message into an exception?
I guess it would just skip saving that too-big-object into the TFile, and continue with the rest of objects. But it's just a guess, best thing would be try it out with a simple reproducer. Maybe trying to save a TH2 with a lot of bins in X and Y does the trick for the reproducer.
FWIW, here is a stack trace to the where the exception gets thrown
(gdb) where
#0 0x00007ffff55222f1 in __cxxabiv1::__cxa_throw (obj=0x7ffff1885280, tinfo=0x7ffff79a6610 <typeinfo for edm::Exception>, dest=0x7ffff7970020 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1 0x00007fffefb2c979 in (anonymous namespace)::RootErrorHandlerImpl(int, char const*, char const*) [clone .cold] () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007ffff690bc54 in ErrorHandler(Int_t, const char *, const char *, typedef __va_list_tag __va_list_tag *) (level=3000, location=0x7ffff4001f3c "TBufferFile::WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)", ap=0x7fffffff2218)
at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/foundation/src/TError.cxx:148
#3 0x00007ffff6848126 in TObject::DoError (this=0x7fffe568ce80, level=3000, location=0x7ffff7026d6b "WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)", va=0x7fffffff2218) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/base/src/TObject.cxx:943
#4 0x00007ffff684838f in TObject::Error (this=0x7fffe568ce80, location=0x7ffff7026d6b "WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/base/src/TObject.cxx:980
#5 0x00007ffff6d898b7 in TBufferFile::SetByteCount (this=0x7fffe568ce80, cntpos=243, packInVersion=true) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:349
#6 0x00007ffff68db51b in TObjArray::Streamer (this=0x7fffea5b7758, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/cont/src/TObjArray.cxx:489
#7 0x00007ffff692f31a in TClass::StreamerTObjectInitialized (pThis=0x7fffee723c00, object=0x7fffea5b7758, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/meta/src/TClass.cxx:6817
#8 0x00007ffff784270b in TClass::Streamer (this=0x7fffee723c00, obj=0x7fffea5b7758, b=..., onfile_class=0x0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/meta/inc/TClass.h:610
#9 0x00007ffff6d8e832 in TBufferFile::WriteFastArray (this=0x7fffe568ce80, start=0x7fffea5b7758, cl=0x7fffee723c00, n=1, streamer=0x0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:2354
#10 0x00007ffff700808c in TStreamerInfo::WriteBufferAux<char**> (this=0x7fff9de73a00, b=..., arr=@0x7fffffff2f30: 0x7fffffff2f28, compinfo=0x7fff9e4397c8, first=0, last=1, narr=1, eoffset=0, arrayMode=0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TStreamerInfoWriteBuffer.cxx:628
#11 0x00007ffff6e7e814 in TStreamerInfoActions::GenericWriteAction (buf=..., addr=0x7fffea5b7600, config=0x7fff9e4397b0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TStreamerInfoActions.cxx:202
#12 0x00007ffff6d94a07 in TStreamerInfoActions::TConfiguredAction::operator() (this=0x7fffbab24100, buffer=..., object=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/inc/TStreamerInfoActions.h:123
#13 0x00007ffff6d927ec in TBufferFile::ApplySequence (this=0x7fffe568ce80, sequence=..., obj=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:3679
#14 0x00007ffff6d925d4 in TBufferFile::WriteClassBuffer (this=0x7fffe568ce80, cl=0x7fff9df9ff00, pointer=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:3648
#15 0x00007ffff78c3650 in TTree::Streamer (this=0x7fffea5b7600, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/tree/tree/src/TTree.cxx:9623
#16 0x00007ffff6e4c82f in TKey::TKey (this=0x7fffde449940, obj=0x7fffea5b7600, name=0x7fffea5b7619 "Events", bufsize=265751, motherDir=0x7fffdff2d980) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TKey.cxx:249
#17 0x00007ffff6e11b27 in TFile::CreateKey (this=0x7fffdff2d980, mother=0x7fffdff2d980, obj=0x7fffea5b7600, name=0x7fffea5b7619 "Events", bufsize=265751) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TFile.cxx:1031
#18 0x00007ffff6dfd754 in TDirectoryFile::WriteTObject (this=0x7fffdff2d980, obj=0x7fffea5b7600, name=0x0, option=0x7ffff78e0f89 "", bufsize=0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1965
#19 0x00007ffff78aae26 in TTree::AutoSave (this=0x7fffea5b7600, option=0x7fff551533e8 "FlushBaskets") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/tree/tree/src/TTree.cxx:1522
#20 0x00007fff551452b9 in edm::RootOutputFile::finishEndFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#21 0x00007fff55130a55 in edm::PoolOutputModule::finishEndFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#22 0x00007fff55130ba3 in edm::PoolOutputModule::reallyCloseFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#23 0x00007ffff7c848e5 in edm::Schedule::closeOutputFiles() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x00007ffff7bda60d in edm::EventProcessor::closeOutputFiles() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#25 0x00007ffff7bdfc9a in edm::EventProcessor::runToCompletion() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#26 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#27 0x00007ffff5d719ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#28 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#29 0x000000000040517c in main ()
How could we get more context on what is causing the "too big object"?
There seems to be an object or class that is stored in your TTree, whose Streamer is too big, at some point when AutoSave is called.
For example, if I do this:
TFile a("/tmp/big.root","RECREATE");
TH2D h("h", "h", 100000, 0, 10, 100000, 0, 10);
h.Write();
I will get a similar crash as yours, though not exactly through the same path. If you'd try to store this histogram in the TTree directly and call AutoSave, you might get closer to what you are seeing.
So maybe, you would need to do a tree->Print()
or to revise what branches and classes you have stored in this tree. One of them has a Streamer that stores too much data in it.
One detail to be noted about the output are sizes of the output files at the time the job died
16M SKIMStreamEXODelayedJetMET.root
8.8M SKIMStreamEXODisappTrk.root
57G SKIMStreamEXOHighMET.root
6.6G SKIMStreamEXOSoftDisplacedVertices.root
1.3G SKIMStreamJetHTJetPlusHOFilter.root
1.3M SKIMStreamLogErrorMonitor.root
545M SKIMStreamLogError.root
56K SKIMStreamTeVJet.root
572M write_ALCARECO.root
6.0G write_AOD.root
21M write_DQMIO.root
829M write_MINIAOD.root
64M write_NANOAOD.root
57 GB of SKIMStreamEXOHighMET.root
is highly suspicious (@cms-sw/pdmv-l2)
Inspecting SKIMStreamEXOHighMET.root
it seems the job was writing that one when it failed
[mkortela@cmsdev42 /build/mkortela/debug/issue45089/CMSSW_14_0_7/src]$ root -l -n SKIMStreamEXOHighMET.root
root [0]
Attaching file SKIMStreamEXOHighMET.root as _file0...
Warning in <TFile::Init>: file SKIMStreamEXOHighMET.root probably not closed, trying to recover
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:MetaData at address 61136680973
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:ParameterSets at address 61136746357
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:Parentage at address 61136752149
Warning in <TFile::Init>: successfully recovered 3 keys
(TFile *) 0x4a153f0
root [1] .ls
TFile** SKIMStreamEXOHighMET.root
TFile* SKIMStreamEXOHighMET.root
KEY: TTree MetaData;1
KEY: TTree ParameterSets;1
KEY: TTree Parentage;1
root [2]
57 GB of
SKIMStreamEXOHighMET.root
is highly suspicious
The SKIMStreamEXOHighMET
selects 16084 events in the job, so the average event size is 3.6 MB. Is this really justified? @cms-sw/pdmv-l2
you would need to do a
tree->Print()
I tried to call TTree::Print()
a little bit before the TTree::AutoSave()
is called, but that leads to the error message (that we turn to an exception) as well
----- Begin Fatal Exception 06-Jun-2024 08:03:20 CEST-----------------------
An exception of category 'FatalRootError' occurred while
[0] Calling RootOutputFile::finishEndFile() while closing SKIMStreamEXOHighMET.root
Additional Info:
[a] While calling Print() for Events
[b] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)
----- End Fatal Exception -------------------------------------------------
This is as far as I got (until July). I have the input file and the job configuration on cmsdev42
in /build/mkortela/debug/issue45089/CMSSW_14_0_7/src
in case that would be of help.
How could we get more context on what is causing the "too big object"?
The stack trace seems to imply that it is the TTree
itself that is too large. This could happens in two cases
(a) (unlikely) Some large basket(s) are somehow not flush properly and store with the TTree
(b) (more likely) The TTree
reached the maximum numbers of baskets in total (each baskets cost the storage of its size and location). (50 millions-ish lead go over the 1GB threshold)
(c) something I am missing :)
Dear all, As reported in cms talk we have a paused job for the workflow
PromptReco_Run381067_JetMET1
in Run 381067, with the following error:/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FatalRootError/ByteCountTooLarge/vocms013.cern.ch-3587397-3-log.tar.gz
/afs/cern.ch/user/m/mpresill/public/ORM_May29/CMSSW_14_0_7/src/job/WMTaskSpace/cmsRun1/cmsRun1-stdout_original.log
Matteo (ORM)