cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

[ROOT] Unit Tests `testDataFormatsScoutingRunX` failing in ROOT IBs #41222

Closed aandvalenzuela closed 1 year ago

aandvalenzuela commented 1 year ago

Hello,

There are two Unit Tests (testDataFormatsScoutingRun2 and testDataFormatsScoutingRun3 from module DataFormats/Scouting) failing in both ROOT6 and ROOT628 IBs.

I have reproduced the failure using CMSSW_13_1_ROOT628_X_2023-03-27-2300:

28-Mar-2023 10:02:46 CEST  Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/mc/RunIISummer20UL18RECO/DoubleElectron_Pt-1To300-gun/AODSIM/FlatPU0to70EdalIdealGT_EdalIdealGT_106X_upgrade2018_realistic_v11_L1v1_EcalIdealIC-v2/270000/4CDD9457-E14C-D84A-9BD4-3140CB6AEEB6.root
/tmp/avalenzu/CMSSW_13_1_ROOT628_X_2023-03-27-2300/src/DataFormats/Scouting/test/testDataFormatsScoutingRun2.sh: line 25: 19155 Segmentation fault      "${SCRAM_TEST_PATH}"/scoutingCollectionsDumper.py -v 1 -n 1 --skip 81 -i testDataFormatsScoutingRun2_step1.root -k Scouting &> testDataFormatsScoutingRun2_step2.log

malloc_consolidate(): invalid chunk size
 *** Break *** abort
malloc(): unsorted double linked list corrupted
 *** Break *** abort

They fail when running scoutingCollectionsDumper.py (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/testDataFormatsScoutingRun2.sh#L23) and I would say after analyzing the first event (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/scoutingCollectionsDumper.py#L174).

Find the full log here.

Let's follow-up the discussion started on https://github.com/cms-sw/cmssw/pull/41189#issuecomment-1486406281.

FYI, @Dr15Jones, @smuzaffar, @iarspider

cmsbuild commented 1 year ago

A new Issue was created by @aandvalenzuela Andrea Valenzuela.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

aandvalenzuela commented 1 year ago

assign core

cmsbuild commented 1 year ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

smuzaffar commented 1 year ago

gdb shows this

Events processed = 1
[Thread 0x7fffd26c1700 (LWP 30936) exited]
malloc_consolidate(): invalid chunk size

Thread 1 "python3" received signal SIGABRT, Aborted.
0x00007ffff6ad2aff in raise () from /lib64/libc.so.6
(gdb) where
#0  0x00007ffff6ad2aff in raise () from /lib64/libc.so.6
#1  0x00007ffff6aa5ea5 in abort () from /lib64/libc.so.6
#2  0x00007ffff6b15097 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6b1c4ec in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff6b1ceb8 in malloc_consolidate () from /lib64/libc.so.6
#5  0x00007ffff6b1f268 in _int_malloc () from /lib64/libc.so.6
#6  0x00007ffff6b20b72 in malloc () from /lib64/libc.so.6
#7  0x00007ffff5ec19ac in operator new (sz=5560) at ../../../../libstdc++-v3/libsupc++/new_op.cc:50
#8  0x00007fffef7b72e4 in cling::IncrementalParser::getAllTransactions() ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ROOT628_X_2023-03-28-2300/external/el8_amd64_gcc11/lib/libCling.so
#9  0x00007fffef734d82 in cling::Interpreter::runAndRemoveStaticDestructors() ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ROOT628_X_2023-03-28-2300/external/el8_amd64_gcc11/lib/libCling.so
#10 0x00007fffef507b8c in TCling::ResetGlobals (this=0x57e350)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/metacling/src/TCling.cxx:3718
#11 0x00007fffef4ff9f7 in TCling::ShutDown (this=0x57e350)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/metacling/src/TCling.cxx:1692
#12 0x00007ffff62de1a2 in TROOT::EndOfProcessCleanups (this=0x7ffff6655ba0 <ROOT::Internal::GetROOT1()::alloc>)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/base/src/TROOT.cxx:1236
#13 0x00007fffc6835010 in ?? ()
#14 0x00007fffffff19e0 in ?? ()
#15 0x00007ffff66bd0fd in WrapperCall (method=0, nargs=140737188853984, args_=0x7ffff7fb1040, self=0x7fffffff1880, result=0x5f4fc0)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/bindings/pyroot/cppyy/cppyy-backend/clingwrapper/src/clingwrapper.cxx:774
smuzaffar commented 1 year ago

@vgvassilev , these tests are failing for root 6.28 and root master based builds but work for root 6.26. tests fail at the end when python (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/scoutingCollectionsDumper.py) process is finished. Any idea why this is failing for root 628/master?

vgvassilev commented 1 year ago

This area has been quite problematic due to the way that ROOT is being shut down. @pcanal, do you know how we can work this issue with EndOfProcessCleanups around?

pcanal commented 1 year ago

The destruction order is delicate and I remember we had some issue with the tear down triggering from python. Any chance to have a standalone reproducer (so that I can start by bisecting where/when it started failing)?

pcanal commented 1 year ago

Also could you run the failing example with valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ... to get more information about this (seemingly odd because it seems the allocation is failing) problem.

smuzaffar commented 1 year ago

@pcanal , you can find the valgrid output here

davidlange6 commented 1 year ago

A number of reports like this seem related..

==19874== Invalid free() / delete / delete[] / realloc() ==19874== at 0x403C4CD: operator delete(void) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==19874== by 0x4ECCA891: ??? ==19874== by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so) ==19874== by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so) ==19874== by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718) ==19874== by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215) ==19874== by 0x551E600F: ??? ==19874== by 0x7D400FC: WrapperCall(long, unsigned long, void, void, void) (clingwrapper.cxx:774) ==19874== by 0x7D403EA: Cppyy::CallV(long, void, unsigned long, void) (clingwrapper.cxx:825) ==19874== by 0x7C82290: GILCallV(long, void, CPyCppyy::CallContext) (Executors.cxx:68) ==19874== by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void, CPyCppyy::CallContext) (Executors.cxx:410) ==19874== by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74) ==19874== Address 0x495f7cc0 is 0 bytes inside a block of size 20 free'd

On Mar 31, 2023, at 5:49 AM, Malik Shahzad Muzaffar @.***> wrote:

@pcanal , you can find the valgrid output here — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

vgvassilev commented 1 year ago

Also could you run the failing example with valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ... to get more information about this (seemingly odd because it seems the allocation is failing) problem.

@pcanal probably meant valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp. This would avoid the false positives coming from the python interpreter.

smuzaffar commented 1 year ago

valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp log is available here

pcanal commented 1 year ago

@pcanal probably meant valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp. This would avoid the false positives coming from the python interpreter.

Actually you need both.

pcanal commented 1 year ago

Also this seem eerily similar to https://github.com/cms-sw/root/issues/136

Axel-Naumann commented 1 year ago

Just providing the full context for @davidlange6 's (imo correct) issue diagnosis:

==19874== Invalid free() / delete / delete[] / realloc()
==19874==    at 0x403C4CD: operator delete(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874==    by 0x4ECCA85E: ???
==19874==    by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718)
==19874==    by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215)
==19874==    by 0x551E600F: ???
==19874==    by 0x7D400FC: WrapperCall(long, unsigned long, void*, void*, void*) (clingwrapper.cxx:774)
==19874==    by 0x7D403EA: Cppyy::CallV(long, void*, unsigned long, void*) (clingwrapper.cxx:825)
==19874==    by 0x7C82290: GILCallV(long, void*, CPyCppyy::CallContext*) (Executors.cxx:68)
==19874==    by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void*, CPyCppyy::CallContext*) (Executors.cxx:410)
==19874==    by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74)
==19874==  Address 0x4c4ca1c0 is 0 bytes inside a block of size 29 free'd
==19874==    at 0x403C4CD: operator delete(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874==    by 0x4ECCA85E: ???
==19874==    by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718)
==19874==    by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215)
==19874==    by 0x551E600F: ???
==19874==    by 0x7D400FC: WrapperCall(long, unsigned long, void*, void*, void*) (clingwrapper.cxx:774)
==19874==    by 0x7D403EA: Cppyy::CallV(long, void*, unsigned long, void*) (clingwrapper.cxx:825)
==19874==    by 0x7C82290: GILCallV(long, void*, CPyCppyy::CallContext*) (Executors.cxx:68)
==19874==    by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void*, CPyCppyy::CallContext*) (Executors.cxx:410)
==19874==    by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74)
==19874==  Block was alloc'd at
==19874==    at 0x4039F01: operator new(unsigned long) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874==    by 0x4ECCAAA1: ???
==19874==    by 0x4ECCA9C1: ???
==19874==    by 0x4ECCA92F: ???
==19874==    by 0x4ECCA2B2: ???
==19874==    by 0x4ECCA90B: ???
==19874==    by 0x4ECCA91C: ???
==19874==    by 0x4ECCA90F: ???
==19874==    by 0xB445061: (anonymous namespace)::GenericLLVMIRPlatformSupport::initialize(llvm::orc::JITDylib&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9EDAEE8: cling::IncrementalExecutor::runStaticInitializersOnce(cling::Transaction&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9E664A7: cling::Interpreter::executeTransaction(cling::Transaction&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9EE97F9: cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
Dr15Jones commented 1 year ago

@pcanal any progress on this?

Axel-Naumann commented 1 year ago

It's a traditional double delete, here as a global static in the interpreter memory. We need to find out which object it is, and why it's registered twice as static global. @pcanal do you think you can try to reproduce this? If so please take it. If not please assign it to me and I will debug next week (I'm off now).

pcanal commented 1 year ago

@Dr15Jones Can you remind me the reproducer?

Dr15Jones commented 1 year ago

@makortel could you help Philippe to reproduce the problem?

davidlange6 commented 1 year ago

something like

scram p CMSSW_13_2_ROOT628_X_2023-05-22-2300
cd CMSSW_13_2_ROOT628_X_2023-05-22-2300/src
cmsenv
git cms-addpkg DataFormats/Scouting
scram b
scram b runtests
makortel commented 1 year ago

Just for the record, I shared a simpler recipe to Philippe.

makortel commented 1 year ago

https://github.com/cms-sw/cmssw/pull/41964 removes the failing tests (after better tests were developed).

@pcanal Is there a need to preserve a setup to reproduce the crash?

pcanal commented 1 year ago

Yes, this is not resolved.

makortel commented 1 year ago

@smuzaffar Would it be feasible to have a longer-than-IB-lifetime build for ROOT628? Or do we keep the tests failing for now?

smuzaffar commented 1 year ago

I would say, let the test fail for now.

Dr15Jones commented 1 year ago

So I did a small amount of testing and I see the failure using the following minimum python

import ROOT

f = ROOT.TFile.Open("testDataFormatsScoutingRun2_step1.root")
e = ROOT.fwlite.Event(f)

the job fails as it is attempting to shutdown.

hahnjo commented 1 year ago

I believe this may also be fixed via https://github.com/cms-sw/cmssw/pull/42628, can you confirm?

makortel commented 1 year ago

It seems so. The last failures were in CMSSW_13_3_ROOT6_X_2023-08-28-2300, and https://github.com/cms-sw/cmssw/pull/42628 was merged in CMSSW_13_3_X_2023-08-29-1100. So far there has been 6 IBs (x2 to cover both 6.28 and master) without these failures. I also tested the simple reproducer of https://github.com/cms-sw/cmssw/issues/41222#issuecomment-1629517319 in CMSSW_13_3_ROOT628_X_2023-09-03-2300, and the job succeeded now.

makortel commented 1 year ago

+core

cmsbuild commented 1 year ago

This issue is fully signed and ready to be closed.

makortel commented 1 year ago

@cmsbuild, please close

makortel commented 1 year ago

I'm going to now proceed with the deletion of these unit tests (as they were superseded earlier, and were kept only to the investigation of this issue). If anyone subscribed thinks there would still be value in keeping the tests, scream soon.