cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.06k stars 4.24k forks source link

Workflow PromptReco_Run381067_JetMET1 error in CMSSW_14_0_7 #45089

Open mpresill opened 1 month ago

mpresill commented 1 month ago

Dear all, As reported in cms talk we have a paused job for the workflow PromptReco_Run381067_JetMET1in Run 381067, with the following error:

28-May-2024 14:59:57 UTC  Closed file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/JetMET1/RAW/v1/000/381/067/00000/e186cde3-4166-4946-9343-f0c908376153.root?eos.app=cmst0
----- Begin Fatal Exception 28-May-2024 15:00:07 UTC-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)

Matteo (ORM)

cmsbuild commented 1 month ago

cms-bot internal usage

cmsbuild commented 1 month ago

A new Issue was created by @mpresill.

@makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

francescobrivio commented 1 month ago

assign core

cmsbuild commented 1 month ago

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 month ago

I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace

Thread 1 (Thread 0x153d50106640 (LWP 1030) "cmsRun"):
#3  0x0000153d4a5c2720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000153d50937a63 in TDirectoryFile::WriteKeys() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6  0x0000153d509348bc in TDirectoryFile::SaveSelf(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7  0x0000153d50932d27 in TDirectoryFile::Save() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8  0x0000153d50932bfc in TDirectoryFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9  0x0000153d509506d6 in TFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x0000153d4a4d013e in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x0000153d4a4d0169 in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x0000153cab9c0a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x0000153cab9c38e5 in edm::RootOutputFile::~RootOutputFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x0000153cab9c4deb in edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x0000153cab9c51b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x0000153cab9fa6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x0000153cab9fa811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x0000153d513eccb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000153d513f6d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x0000153d51354749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x0000153d51346df7 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()

Symptoms are the same as in https://github.com/cms-sw/cmssw/issues/40132#issuecomment-1323831885 .

Was any of the jobs re-tried in Tier0?

makortel commented 1 month ago

FYI @pcanal

pcanal commented 1 month ago

That is strange. It reports problem with the TFile object itself. I.e. one potential cause is that the TFile was already closed (or is being closed by another thread).

mpresill commented 1 month ago

I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace

Thread 1 (Thread 0x153d50106640 (LWP 1030) "cmsRun"):
#3  0x0000153d4a5c2720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000153d50937a63 in TDirectoryFile::WriteKeys() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6  0x0000153d509348bc in TDirectoryFile::SaveSelf(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7  0x0000153d50932d27 in TDirectoryFile::Save() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8  0x0000153d50932bfc in TDirectoryFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9  0x0000153d509506d6 in TFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x0000153d4a4d013e in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x0000153d4a4d0169 in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x0000153cab9c0a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x0000153cab9c38e5 in edm::RootOutputFile::~RootOutputFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x0000153cab9c4deb in edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x0000153cab9c51b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x0000153cab9fa6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x0000153cab9fa811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x0000153d513eccb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000153d513f6d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x0000153d51354749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x0000153d51346df7 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()

Symptoms are the same as in #40132 (comment) .

Was any of the jobs re-tried in Tier0?

It was not retried yet. Should this be re-tried?

makortel commented 1 month ago

Was any of the jobs re-tried in Tier0?

It was not retried yet. Should this be re-tried?

Probably not worth it. I tested the job locally (with a local input file), and it fails in the same way.

makortel commented 1 month ago

I noticed many printouts from PileupJetIdProducer in the log (being responsible of 767 MB of the 844 MB of the cmsRun1-stdout.log). I opened a separate issue https://github.com/cms-sw/cmssw/issues/45099 about that.

mpresill commented 1 month ago

FYI, there is new failed job for the same run and workflow _PromptReco_Run381067JetMET0 with same error.

makortel commented 1 month ago

Following https://github.com/cms-sw/cmssw/issues/40132#issuecomment-1933671795 I tested 14_0_7 with the backport of that (https://github.com/cms-sw/cmssw/issues/40132#issuecomment-2141532075), but the behavior was the same

01-Jun-2024 00:21:10 CEST  Closed file file:e186cde3-4166-4946-9343-f0c908376153.root
----- Begin Fatal Exception 01-Jun-2024 00:21:20 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)

----- End Fatal Exception -------------------------------------------------

<cut>

Thread 1 (Thread 0x7f98df012680 (LWP 316619) "cmsRun"):
#3  0x00007f98dab6a720 in sig_dostack_then_abort () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f98e1e0a48a in TDirectoryFile::WriteKeys (this=0x7f98caf31580) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:2187
#6  0x00007f98e1e08228 in TDirectoryFile::SaveSelf (this=0x7f98caf31580, force=false) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1634
#7  0x00007f98e1e07c3c in TDirectoryFile::Save (this=0x7f98caf31580) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1552
#8  0x00007f98e1e03c59 in TDirectoryFile::Close (this=0x7f98caf31580, option=0x7f98da1a6a26 "") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:568
#9  0x00007f98e1e1d640 in TFile::Close (this=0x7f98caf31580, option=0x7f98da1a6a26 "") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TFile.cxx:970
#10 0x00007f98da1a413e in TStorageFactoryFile::~TStorageFactoryFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x00007f98da1a4169 in TStorageFactoryFile::~TStorageFactoryFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x00007f9840175a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x00007f98401788e5 in edm::RootOutputFile::~RootOutputFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x00007f9840179deb in edm::PoolOutputModule::~PoolOutputModule() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x00007f984017a1b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x00007f98401af6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x00007f98401af811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x00007f98e2bc72a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f98e2c89cb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f98e2c93d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00007f98e2bc72a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x00007f98e2bf1749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x00007f98e2be3df7 in edm::EventProcessor::~EventProcessor() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()
ferdymercury commented 1 month ago

The first error in each job is the one reported by the (CMSSW) exception message, i.e.

* `Request to expand to a negative size, likely due to an integer overflow: 0x8001588e for a max of 0x7ffffffe.`

Please note that https://github.com/root-project/root/pull/14627 should only solve the bug [1] from https://github.com/cms-sw/cmssw/issues/40132#issuecomment-1933671795: (request to expand to a negative size).

The exception "TBufferFile::WriteByteCount bytecount too large (more than 1073741822)", ie bug [2] in the other issue, is not a bug, but rather a wanted exception, which tells you that your TFile contains a key that exceeds the maximum allowed size of 1 GB (https://github.com/root-project/root/issues/6734)

Workarounds would be to address https://github.com/root-project/root/issues/6734, or to make your object a bit smaller, or to reroot the exception with a custom error handler so that it does not throw a fatal error.

makortel commented 1 month ago

type root

makortel commented 1 month ago

to reroot the exception with a custom error handler so that it does not throw a fatal error.

What exactly would the ROOT state be then at the point where it issues the error message bytecount too large but the execution would continue without the conversion of the error message into an exception?

ferdymercury commented 1 month ago

I guess it would just skip saving that too-big-object into the TFile, and continue with the rest of objects. But it's just a guess, best thing would be try it out with a simple reproducer. Maybe trying to save a TH2 with a lot of bins in X and Y does the trick for the reproducer.

makortel commented 1 month ago

FWIW, here is a stack trace to the where the exception gets thrown

(gdb) where
#0  0x00007ffff55222f1 in __cxxabiv1::__cxa_throw (obj=0x7ffff1885280, tinfo=0x7ffff79a6610 <typeinfo for edm::Exception>, dest=0x7ffff7970020 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffefb2c979 in (anonymous namespace)::RootErrorHandlerImpl(int, char const*, char const*) [clone .cold] () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007ffff690bc54 in ErrorHandler(Int_t, const char *, const char *, typedef __va_list_tag __va_list_tag *) (level=3000, location=0x7ffff4001f3c "TBufferFile::WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)", ap=0x7fffffff2218)
    at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/foundation/src/TError.cxx:148
#3  0x00007ffff6848126 in TObject::DoError (this=0x7fffe568ce80, level=3000, location=0x7ffff7026d6b "WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)", va=0x7fffffff2218) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/base/src/TObject.cxx:943
#4  0x00007ffff684838f in TObject::Error (this=0x7fffe568ce80, location=0x7ffff7026d6b "WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/base/src/TObject.cxx:980
#5  0x00007ffff6d898b7 in TBufferFile::SetByteCount (this=0x7fffe568ce80, cntpos=243, packInVersion=true) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:349
#6  0x00007ffff68db51b in TObjArray::Streamer (this=0x7fffea5b7758, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/cont/src/TObjArray.cxx:489
#7  0x00007ffff692f31a in TClass::StreamerTObjectInitialized (pThis=0x7fffee723c00, object=0x7fffea5b7758, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/meta/src/TClass.cxx:6817
#8  0x00007ffff784270b in TClass::Streamer (this=0x7fffee723c00, obj=0x7fffea5b7758, b=..., onfile_class=0x0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/meta/inc/TClass.h:610
#9  0x00007ffff6d8e832 in TBufferFile::WriteFastArray (this=0x7fffe568ce80, start=0x7fffea5b7758, cl=0x7fffee723c00, n=1, streamer=0x0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:2354
#10 0x00007ffff700808c in TStreamerInfo::WriteBufferAux<char**> (this=0x7fff9de73a00, b=..., arr=@0x7fffffff2f30: 0x7fffffff2f28, compinfo=0x7fff9e4397c8, first=0, last=1, narr=1, eoffset=0, arrayMode=0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TStreamerInfoWriteBuffer.cxx:628
#11 0x00007ffff6e7e814 in TStreamerInfoActions::GenericWriteAction (buf=..., addr=0x7fffea5b7600, config=0x7fff9e4397b0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TStreamerInfoActions.cxx:202
#12 0x00007ffff6d94a07 in TStreamerInfoActions::TConfiguredAction::operator() (this=0x7fffbab24100, buffer=..., object=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/inc/TStreamerInfoActions.h:123
#13 0x00007ffff6d927ec in TBufferFile::ApplySequence (this=0x7fffe568ce80, sequence=..., obj=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:3679
#14 0x00007ffff6d925d4 in TBufferFile::WriteClassBuffer (this=0x7fffe568ce80, cl=0x7fff9df9ff00, pointer=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:3648
#15 0x00007ffff78c3650 in TTree::Streamer (this=0x7fffea5b7600, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/tree/tree/src/TTree.cxx:9623
#16 0x00007ffff6e4c82f in TKey::TKey (this=0x7fffde449940, obj=0x7fffea5b7600, name=0x7fffea5b7619 "Events", bufsize=265751, motherDir=0x7fffdff2d980) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TKey.cxx:249
#17 0x00007ffff6e11b27 in TFile::CreateKey (this=0x7fffdff2d980, mother=0x7fffdff2d980, obj=0x7fffea5b7600, name=0x7fffea5b7619 "Events", bufsize=265751) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TFile.cxx:1031
#18 0x00007ffff6dfd754 in TDirectoryFile::WriteTObject (this=0x7fffdff2d980, obj=0x7fffea5b7600, name=0x0, option=0x7ffff78e0f89 "", bufsize=0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1965
#19 0x00007ffff78aae26 in TTree::AutoSave (this=0x7fffea5b7600, option=0x7fff551533e8 "FlushBaskets") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/tree/tree/src/TTree.cxx:1522
#20 0x00007fff551452b9 in edm::RootOutputFile::finishEndFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#21 0x00007fff55130a55 in edm::PoolOutputModule::finishEndFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#22 0x00007fff55130ba3 in edm::PoolOutputModule::reallyCloseFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#23 0x00007ffff7c848e5 in edm::Schedule::closeOutputFiles() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x00007ffff7bda60d in edm::EventProcessor::closeOutputFiles() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#25 0x00007ffff7bdfc9a in edm::EventProcessor::runToCompletion() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#26 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#27 0x00007ffff5d719ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#28 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#29 0x000000000040517c in main ()

How could we get more context on what is causing the "too big object"?

ferdymercury commented 1 month ago

There seems to be an object or class that is stored in your TTree, whose Streamer is too big, at some point when AutoSave is called.

For example, if I do this:

TFile a("/tmp/big.root","RECREATE");
TH2D h("h", "h", 100000, 0, 10, 100000, 0, 10);
h.Write();

I will get a similar crash as yours, though not exactly through the same path. If you'd try to store this histogram in the TTree directly and call AutoSave, you might get closer to what you are seeing.

So maybe, you would need to do a tree->Print() or to revise what branches and classes you have stored in this tree. One of them has a Streamer that stores too much data in it.

makortel commented 1 month ago

One detail to be noted about the output are sizes of the output files at the time the job died

16M     SKIMStreamEXODelayedJetMET.root
8.8M    SKIMStreamEXODisappTrk.root
57G     SKIMStreamEXOHighMET.root
6.6G    SKIMStreamEXOSoftDisplacedVertices.root
1.3G    SKIMStreamJetHTJetPlusHOFilter.root
1.3M    SKIMStreamLogErrorMonitor.root
545M    SKIMStreamLogError.root
56K     SKIMStreamTeVJet.root
572M    write_ALCARECO.root
6.0G    write_AOD.root
21M     write_DQMIO.root
829M    write_MINIAOD.root
64M     write_NANOAOD.root

57 GB of SKIMStreamEXOHighMET.root is highly suspicious (@cms-sw/pdmv-l2)

makortel commented 1 month ago

Inspecting SKIMStreamEXOHighMET.root it seems the job was writing that one when it failed

[mkortela@cmsdev42 /build/mkortela/debug/issue45089/CMSSW_14_0_7/src]$ root -l -n SKIMStreamEXOHighMET.root
root [0]
Attaching file SKIMStreamEXOHighMET.root as _file0...
Warning in <TFile::Init>: file SKIMStreamEXOHighMET.root probably not closed, trying to recover
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:MetaData at address 61136680973
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:ParameterSets at address 61136746357
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:Parentage at address 61136752149
Warning in <TFile::Init>: successfully recovered 3 keys
(TFile *) 0x4a153f0
root [1] .ls
TFile**         SKIMStreamEXOHighMET.root
 TFile*         SKIMStreamEXOHighMET.root
  KEY: TTree    MetaData;1
  KEY: TTree    ParameterSets;1
  KEY: TTree    Parentage;1
root [2]
makortel commented 1 month ago

57 GB of SKIMStreamEXOHighMET.root is highly suspicious

The SKIMStreamEXOHighMET selects 16084 events in the job, so the average event size is 3.6 MB. Is this really justified? @cms-sw/pdmv-l2

makortel commented 1 month ago

you would need to do a tree->Print()

I tried to call TTree::Print() a little bit before the TTree::AutoSave() is called, but that leads to the error message (that we turn to an exception) as well

----- Begin Fatal Exception 06-Jun-2024 08:03:20 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling RootOutputFile::finishEndFile() while closing SKIMStreamEXOHighMET.root
   Additional Info:
      [a] While calling Print() for Events
      [b] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)

----- End Fatal Exception -------------------------------------------------

This is as far as I got (until July). I have the input file and the job configuration on cmsdev42 in /build/mkortela/debug/issue45089/CMSSW_14_0_7/src in case that would be of help.

pcanal commented 1 month ago

How could we get more context on what is causing the "too big object"?

The stack trace seems to imply that it is the TTree itself that is too large. This could happens in two cases (a) (unlikely) Some large basket(s) are somehow not flush properly and store with the TTree (b) (more likely) The TTree reached the maximum numbers of baskets in total (each baskets cost the storage of its size and location). (50 millions-ish lead go over the 1GB threshold) (c) something I am missing :)