Closed aandvalenzuela closed 1 year ago
A new Issue was created by @aandvalenzuela Andrea Valenzuela.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
gdb shows this
Events processed = 1
[Thread 0x7fffd26c1700 (LWP 30936) exited]
malloc_consolidate(): invalid chunk size
Thread 1 "python3" received signal SIGABRT, Aborted.
0x00007ffff6ad2aff in raise () from /lib64/libc.so.6
(gdb) where
#0 0x00007ffff6ad2aff in raise () from /lib64/libc.so.6
#1 0x00007ffff6aa5ea5 in abort () from /lib64/libc.so.6
#2 0x00007ffff6b15097 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff6b1c4ec in malloc_printerr () from /lib64/libc.so.6
#4 0x00007ffff6b1ceb8 in malloc_consolidate () from /lib64/libc.so.6
#5 0x00007ffff6b1f268 in _int_malloc () from /lib64/libc.so.6
#6 0x00007ffff6b20b72 in malloc () from /lib64/libc.so.6
#7 0x00007ffff5ec19ac in operator new (sz=5560) at ../../../../libstdc++-v3/libsupc++/new_op.cc:50
#8 0x00007fffef7b72e4 in cling::IncrementalParser::getAllTransactions() ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ROOT628_X_2023-03-28-2300/external/el8_amd64_gcc11/lib/libCling.so
#9 0x00007fffef734d82 in cling::Interpreter::runAndRemoveStaticDestructors() ()
from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ROOT628_X_2023-03-28-2300/external/el8_amd64_gcc11/lib/libCling.so
#10 0x00007fffef507b8c in TCling::ResetGlobals (this=0x57e350)
at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/metacling/src/TCling.cxx:3718
#11 0x00007fffef4ff9f7 in TCling::ShutDown (this=0x57e350)
at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/metacling/src/TCling.cxx:1692
#12 0x00007ffff62de1a2 in TROOT::EndOfProcessCleanups (this=0x7ffff6655ba0 <ROOT::Internal::GetROOT1()::alloc>)
at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/base/src/TROOT.cxx:1236
#13 0x00007fffc6835010 in ?? ()
#14 0x00007fffffff19e0 in ?? ()
#15 0x00007ffff66bd0fd in WrapperCall (method=0, nargs=140737188853984, args_=0x7ffff7fb1040, self=0x7fffffff1880, result=0x5f4fc0)
at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/bindings/pyroot/cppyy/cppyy-backend/clingwrapper/src/clingwrapper.cxx:774
@vgvassilev , these tests are failing for root 6.28 and root master based builds but work for root 6.26. tests fail at the end when python (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/scoutingCollectionsDumper.py) process is finished. Any idea why this is failing for root 628/master?
This area has been quite problematic due to the way that ROOT is being shut down. @pcanal, do you know how we can work this issue with EndOfProcessCleanups
around?
The destruction order is delicate and I remember we had some issue with the tear down triggering from python. Any chance to have a standalone reproducer (so that I can start by bisecting where/when it started failing)?
Also could you run the failing example with valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ...
to get more information about this (seemingly odd because it seems the allocation is failing) problem.
A number of reports like this seem related..
==19874== Invalid free() / delete / delete[] / realloc() ==19874== at 0x403C4CD: operator delete(void) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==19874== by 0x4ECCA891: ??? ==19874== by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so) ==19874== by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so) ==19874== by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718) ==19874== by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215) ==19874== by 0x551E600F: ??? ==19874== by 0x7D400FC: WrapperCall(long, unsigned long, void, void, void) (clingwrapper.cxx:774) ==19874== by 0x7D403EA: Cppyy::CallV(long, void, unsigned long, void) (clingwrapper.cxx:825) ==19874== by 0x7C82290: GILCallV(long, void, CPyCppyy::CallContext) (Executors.cxx:68) ==19874== by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void, CPyCppyy::CallContext) (Executors.cxx:410) ==19874== by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74) ==19874== Address 0x495f7cc0 is 0 bytes inside a block of size 20 free'd
On Mar 31, 2023, at 5:49 AM, Malik Shahzad Muzaffar @.***> wrote:
@pcanal , you can find the valgrid output here — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Also could you run the failing example with
valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ...
to get more information about this (seemingly odd because it seems the allocation is failing) problem.
@pcanal probably meant valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp
. This would avoid the false positives coming from the python interpreter.
valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp
log is available here
@pcanal probably meant valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp. This would avoid the false positives coming from the python interpreter.
Actually you need both.
Also this seem eerily similar to https://github.com/cms-sw/root/issues/136
Just providing the full context for @davidlange6 's (imo correct) issue diagnosis:
==19874== Invalid free() / delete / delete[] / realloc()
==19874== at 0x403C4CD: operator delete(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874== by 0x4ECCA85E: ???
==19874== by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718)
==19874== by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215)
==19874== by 0x551E600F: ???
==19874== by 0x7D400FC: WrapperCall(long, unsigned long, void*, void*, void*) (clingwrapper.cxx:774)
==19874== by 0x7D403EA: Cppyy::CallV(long, void*, unsigned long, void*) (clingwrapper.cxx:825)
==19874== by 0x7C82290: GILCallV(long, void*, CPyCppyy::CallContext*) (Executors.cxx:68)
==19874== by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void*, CPyCppyy::CallContext*) (Executors.cxx:410)
==19874== by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74)
==19874== Address 0x4c4ca1c0 is 0 bytes inside a block of size 29 free'd
==19874== at 0x403C4CD: operator delete(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874== by 0x4ECCA85E: ???
==19874== by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718)
==19874== by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215)
==19874== by 0x551E600F: ???
==19874== by 0x7D400FC: WrapperCall(long, unsigned long, void*, void*, void*) (clingwrapper.cxx:774)
==19874== by 0x7D403EA: Cppyy::CallV(long, void*, unsigned long, void*) (clingwrapper.cxx:825)
==19874== by 0x7C82290: GILCallV(long, void*, CPyCppyy::CallContext*) (Executors.cxx:68)
==19874== by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void*, CPyCppyy::CallContext*) (Executors.cxx:410)
==19874== by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74)
==19874== Block was alloc'd at
==19874== at 0x4039F01: operator new(unsigned long) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874== by 0x4ECCAAA1: ???
==19874== by 0x4ECCA9C1: ???
==19874== by 0x4ECCA92F: ???
==19874== by 0x4ECCA2B2: ???
==19874== by 0x4ECCA90B: ???
==19874== by 0x4ECCA91C: ???
==19874== by 0x4ECCA90F: ???
==19874== by 0xB445061: (anonymous namespace)::GenericLLVMIRPlatformSupport::initialize(llvm::orc::JITDylib&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9EDAEE8: cling::IncrementalExecutor::runStaticInitializersOnce(cling::Transaction&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9E664A7: cling::Interpreter::executeTransaction(cling::Transaction&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874== by 0x9EE97F9: cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
@pcanal any progress on this?
It's a traditional double delete, here as a global static in the interpreter memory. We need to find out which object it is, and why it's registered twice as static global. @pcanal do you think you can try to reproduce this? If so please take it. If not please assign it to me and I will debug next week (I'm off now).
@Dr15Jones Can you remind me the reproducer?
@makortel could you help Philippe to reproduce the problem?
something like
scram p CMSSW_13_2_ROOT628_X_2023-05-22-2300
cd CMSSW_13_2_ROOT628_X_2023-05-22-2300/src
cmsenv
git cms-addpkg DataFormats/Scouting
scram b
scram b runtests
Just for the record, I shared a simpler recipe to Philippe.
https://github.com/cms-sw/cmssw/pull/41964 removes the failing tests (after better tests were developed).
@pcanal Is there a need to preserve a setup to reproduce the crash?
Yes, this is not resolved.
@smuzaffar Would it be feasible to have a longer-than-IB-lifetime build for ROOT628? Or do we keep the tests failing for now?
I would say, let the test fail for now.
So I did a small amount of testing and I see the failure using the following minimum python
import ROOT
f = ROOT.TFile.Open("testDataFormatsScoutingRun2_step1.root")
e = ROOT.fwlite.Event(f)
the job fails as it is attempting to shutdown.
I believe this may also be fixed via https://github.com/cms-sw/cmssw/pull/42628, can you confirm?
It seems so. The last failures were in CMSSW_13_3_ROOT6_X_2023-08-28-2300, and https://github.com/cms-sw/cmssw/pull/42628 was merged in CMSSW_13_3_X_2023-08-29-1100. So far there has been 6 IBs (x2 to cover both 6.28 and master) without these failures. I also tested the simple reproducer of https://github.com/cms-sw/cmssw/issues/41222#issuecomment-1629517319 in CMSSW_13_3_ROOT628_X_2023-09-03-2300, and the job succeeded now.
+core
This issue is fully signed and ready to be closed.
@cmsbuild, please close
I'm going to now proceed with the deletion of these unit tests (as they were superseded earlier, and were kept only to the investigation of this issue). If anyone subscribed thinks there would still be value in keeping the tests, scream soon.
Hello,
There are two Unit Tests (
testDataFormatsScoutingRun2
andtestDataFormatsScoutingRun3
from moduleDataFormats/Scouting
) failing in both ROOT6 and ROOT628 IBs.I have reproduced the failure using
CMSSW_13_1_ROOT628_X_2023-03-27-2300
:They fail when running
scoutingCollectionsDumper.py
(https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/testDataFormatsScoutingRun2.sh#L23) and I would say after analyzing the first event (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/scoutingCollectionsDumper.py#L174).Find the full log here.
Let's follow-up the discussion started on https://github.com/cms-sw/cmssw/pull/41189#issuecomment-1486406281.
FYI, @Dr15Jones, @smuzaffar, @iarspider