cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

DeepTauId throws (in event with zero taus?) #42444

Open VinInn opened 1 year ago

VinInn commented 1 year ago

The original issue #40437 has taken a different path so I reopen here just for this.

Reminder of relevant post: https://github.com/cms-sw/cmssw/issues/40437#issue-1522624055 https://github.com/cms-sw/cmssw/issues/40437#issuecomment-1373734266 https://github.com/cms-sw/cmssw/issues/28358 (closed. WHY?)

a log file https://cms-unified.web.cern.ch/cms-unified/joblogs/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693/8001/DataProcessing/020da37f-6871-4688-a86e-2b7e8a6bc683-26-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

here the "shooting gun": trying to reproduce the event happen to have ZERO taus https://github.com/cms-sw/cmssw/issues/40437#issuecomment-1633730153

cmsbuild commented 1 year ago

A new Issue was created by @VinInn Vincenzo Innocente.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn commented 1 year ago

I still do not understand how a memory corruption can modify the values in an otherwise empty vector (pointer, size, capacity, all 0) w/o causing a segfault before reaching the throw. I'n my opinion there is a legit vector there with "something" in it that reasonably fake a tau.

VinInn commented 1 year ago

the tau collection is a view: no way to survive even the first deference

slava77 commented 1 year ago

it sounds also there is a reproducibility problem. The PR tests cover the single thread cases somewhat (even though not on the right SSE-only hardware)

Could it be multithreaded reproducibility issue? Perhaps just running reco comparisons on MT job outputs is enough to get an idea.

makortel commented 1 year ago

assign reconstruction

cmsbuild commented 1 year ago

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 11 months ago

Let me tag @cms-sw/tau-pog-l2 also in this issue

makortel commented 11 months ago

Just for the record, I'm hitting into this issue while trying to run an HLT job through VTune. Without VTune the configuration works, but within VTune the job crashes in a few different ways, this exception being one of them.

makortel commented 11 months ago

Running VTune on a single-thread job in ASAN build resulted in

AddressSanitizer: CHECK failed: asan_allocator.cpp:188 "((old)) == ((kAllocBegMagic))" (0x302030303a303020, 0xcc6e96b9cc6e96b9) (tid=2772)
    #0 0x7fddac36d4ba in CheckUnwind ../../../../libsanitizer/asan/asan_rtl.cpp:67
    #1 0x7fddac38de65 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
    #2 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:188
    #3 0x7fddac2d845b in __asan::LargeChunkHeader::Set(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:178
    #4 0x7fddac2d845b in __asan::QuarantineCallback::Recycle(__asan::AsanChunk*) ../../../../libsanitizer/asan/asan_allocator.cpp:204
    #5 0x7fddac2d8655 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:193
    #6 0x7fddac2d8b3d in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:181
    #7 0x7fddac2d411c in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Put(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback, __asan::AsanChunk*, unsigned long) ../../../../libsanitizer/sanitizer_common/sanitizer_quarantine.h:112
    #8 0x7fddac2d411c in __asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) ../../../../libsanitizer/asan/asan_allocator.cpp:665
    #9 0x7fddac3655ad in operator delete(void*, unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:164
    #10 0x7fdd6f44320f in std::_Sp_counted_ptr_inplace<BasicSingleTrajectoryState, churn_allocator<BasicSingleTrajectoryState>, (__gnu_cxx::_Lock_policy)2>::_M_destroy() (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libTrackingToolsTrajectoryState.so+0x5e20f)
    #11 0x7fdc6843beed in CkfTrajectoryBuilder::limitedCandidates(std::shared_ptr<TrajectorySeed const> const&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf4eed)
    #12 0x7fdc6843fbb6 in CkfTrajectoryBuilder::limitedCandidates(TrajectorySeed const&, TempTrajectory&, std::vector<Trajectory, std::allocator<Trajectory> >&) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf8bb6)
    #13 0x7fdc6844079c in CkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0xf979c)
    #14 0x7fdc683c9306 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x82306)
    #15 0x7fdc683d04fe in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libRecoTrackerCkfPattern.so+0x894fe)
    #16 0x7fddac0f6dd0 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_ASAN_X_2023-11-15-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so+0x986dd0)

that hints towards ASAN own data structures being overwritten

makortel commented 11 months ago

Running VTune on a single-thread job of cmsRunGlibC resulted in

*** Error in `cmsRunGlibC': corrupted size vs. prev_size: 0x0000000047f1cc00 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x7f8d1b99f474]
/lib64/libc.so.6(+0x816a4)[0x7f8d1b9a16a4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eadc44)[0x7f8c7f1a0c44]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9eae8bb)[0x7f8c7f1a18bb]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9fd2efd)[0x7f8c7f2c5efd]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x9af7e27)[0x7f8c7edeae27]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x97995d3)[0x7f8c7ea8c5d3]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x71123b7)[0x7f8c7c4053b7]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZNK5Eigen30TensorContractionEvaluatorBaseINS_15TensorEvaluatorIKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS_9TensorMapINS_6TensorIKfLi4ELi1ElEELi16ENS_11MakePointerEEEEEEEKNS8_ISB_SJ_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEENS_16ThreadPoolDeviceEEEE15evalGemmPartialILb1ELb1ELb0ELi0ELb1EEEvPflli+0x7ad)[0x7f8c7c455c6d]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN5Eigen8internal14TensorExecutorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIfLi4ELi1ElEELi16ENS_11MakePointerEEEKNS_17TensorReshapingOpIKNS_6DSizesIlLi4EEEKNS_19TensorContractionOpIKNS_5arrayINS_9IndexPairIlEELm1EEEKNS8_IKNS9_IlLi2EEEKNS_18TensorImagePatchOpILln1ELln1EKNS3_INS4_IKfLi4ELi1ElEELi16ES6_EEEEEEKNS8_ISJ_SO_EEKN10tensorflow33LaunchFusedConv2DWithOutputKernelIfE19OutputKernelWrapperEEEEEEENS_16ThreadPoolDeviceELb1ELNS0_15TiledEvaluationE0EE3runERS15_RKS16_+0x95)[0x7f8c7c487365]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x7195164)[0x7f8c7c488164]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow33LaunchFusedConv2DWithOutputKernelIfEclINS_19BiasAddOutputKernelIfNS_8IdentityEEEEEvRKT_PNS_15OpKernelContextERKNS_6TensorESD_PSB_+0x1ed)[0x7f8c7c4896ad]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13FusedConv2DOpIN5Eigen16ThreadPoolDeviceEfE7ComputeEPNS_15OpKernelContext[39/1803]0x7f8c7c48e1e5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7f8c74ae2f6b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1258ef1)[0x7f8c74bd0ef1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x123e924)[0x7f8c74bb6924]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124436e)[0x7f8c74bbc36e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x124dfa5)[0x7f8c74bc5fa5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x125820c)[0x7f8c74bd020c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f8c74bd1fd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f8c81c8c99c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f8c81dcad78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1244271)[0x7f8c74bbc271]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1248f08)[0x7f8c74bc0f08]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession11RunInternalElRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x7fa)[0x7f8c81ddd61a]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISD_EERKS4_ISB_SaISB_EESL_PS4_ISC_SaISC_EEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x96e)[0x7f8c81de034e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EERKN3tsl6thread17ThreadPoolOptionsE+0xa5)[0x7f8c876b54d5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EEPN3tsl6thread19ThreadPoolInterfaceE+0x19)[0x7f8c876b5589]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0xa729a)[0x7f8bf985f29a]
/build/mkortela/debug/hlt/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTauTagRecoTauPlugins.so(+0x9bb74)[0x7f8bf9853b74]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm6stream21EDProducerAdaptorBase7doEventERKNS_19EventTransitionInfoEPNS_16ActivityRegistryEPKNS_20ModuleCallingContextE+0x141)[0x7f8d1e8f54b1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8a19)[0x7f8d1e865a19]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8fb4)[0x7f8d1e865fb4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so(+0x5f28)[0x7f8d253faf28]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x28083)[0x7f8d1cb4c083]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x13d21b)[0x7f8d1e7ea21b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor11processRunsEv+0x205)[0x7f8d1e7f3bc5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor15runToCompletionEv+0x23f)[0x7f8d1e7f418f]
cmsRunGlibC[0x4074ef]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x149bd)[0x7f8d1cb389bd]
cmsRunGlibC[0x408ed2]
cmsRunGlibC(main+0x14c)[0x40518c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8d1b942555]
cmsRunGlibC[0x405451]
======= Memory map: ========
00400000-00404000 r--p 00000000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00404000-00405000 r-xp 00004000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00405000-0040a000 r-xp 00005000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040a000-0040d000 r--p 0000a000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040d000-0040e000 r--p 0000c000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040e000-0040f000 rw-p 0000d000 00:31 245399302                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040f000-00410000 rwxp 00000000 00:00 0
00419000-4d534000 rw-p 00000000 00:00 0                                  [heap]
7f8bd8000000-7f8bd8021000 rw-p 00000000 00:00 0
7f8bd8021000-7f8bdc000000 ---p 00000000 00:00 0
7f8bdee6c000-7f8be100b000 rw-p 00000000 00:00 0
7f8be305c000-7f8be314c000 rwxp 00000000 00:00 0
7f8be314c000-7f8be8000000 rw-p 00000000 00:00 0
7f8be8000000-7f8be8021000 rw-p 00000000 00:00 0
7f8be8021000-7f8bec000000 ---p 00000000 00:00 0
7f8bec03d000-7f8bec35d000 rwxp 00000000 00:00 0
7f8bec35d000-7f8bec379000 rw-p 00000000 00:00 0
7f8bec379000-7f8bec3c9000 rwxp 00000000 00:00 0
7f8bec3e5000-7f8bec435000 rwxp 00000000 00:00 0
7f8bec444000-7f8bec44a000 r--p 00000000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec450000-7f8bec4a0000 rwxp 00000000 00:00 0
7f8bec4a0000-7f8bec4a2000 r--p 00000000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a2000-7f8bec4a3000 r-xp 00002000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a3000-7f8bec4a4000 r--p 00003000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a4000-7f8bec4a5000 r--p 00003000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a5000-7f8bec4a6000 rw-p 00004000 00:31 231860432                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginTrackingToolsTrajectoryCleaningPlugins.so
7f8bec4a6000-7f8bec507000 rwxp 00000000 00:00 0
makortel commented 11 months ago

Looks like memory corruption inside Tensorflow is not unheard of

(no idea if any of these are in any way related)

makortel commented 11 months ago

We updated tensorflow to 2.12 in 13_3_0_pre1 (built on Aug 3). I wonder if there would be any temporal correlation with the RECO profiling crashes?

makortel commented 11 months ago

At least in CMSSW_13_2_X_2023-11-17-1100 the IgProf job on step3 of workflow 136.889 succeeds, whereas in CMSSW_13_3_X_2023-11-17-1100 the job crashes (but of course there are many other changes in between)

makortel commented 11 months ago

(maybe I'm going in a wrong direction, given that the DeepTauId throwing an exception on nan is known for several years https://github.com/cms-sw/cmssw/issues/28358)

makortel commented 11 months ago

In my HLT menu test case with VTune, if I "disable" the two DeepTauId modules by adding an always-rejecting EDFilter before them, the job succeeds. Enabling either one, or both, of the DeepTauId modules makes the job crash most of the time.

kandrosov commented 11 months ago

Hi @makortel. By "succeed", do you mean just to finish without crashing, or the outputs are always identical? Especially for other taggers that use TF. In the past, we couldn't conclude with 100% certainty whether memory corruption happens in the DeepTau module or it occures elsewhere and corrupts DeepTau-related memory. Is the machine on which one could reproduce the crash accessible by normal users? If yes, could you please share instructions on how one can reproduce the crash?

makortel commented 11 months ago

Hi @kandrosov

By "succeed", do you mean just to finish without crashing, or the outputs are always identical?

I mean "finishes without crashing". I'm not checking the outputs in any way.

I sent you privately the HLT recipe that I used. By adding an event rejecting EDFilter before/after the DeepTauId modules in the menu I see the behavior that

I started also look into the step3 of the workflow 136.889, that is used in the IgProf profiling in IBs (which is currently crashing in 13_3_X and 14_0_X). Running the step3 as it is indeed crashes also in VTune, in a way that looks like memory corruption. If I run everything up to and including the DeepTauId modules there (*), the job still crashes. But, contrary to the HLT test case, if I run everything up to the modules DeepTauId consumes (**), but not DeepTauId, the job crashes as well. This behavior would indicate that something else than (or maybe in addition to?) DeepTauId would cause memory corruption when the job is run through VTune.

I'll continue to investigate (albeit slowly).


(*) by adding a snippet

process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
    process.raw2digi_step,
    process.reconstruction_step,
    process.tauPath
)

to the end of the step3 configuration


(**) by adding a snippet

process.deepTauSequence = cms.Sequence(
    process.offlineSlimmedPrimaryVertices
    + process.packedPFCandidates
    + process.hpsPFTauTransverseImpactParameters
    + process.slimmedTausNoDeepIDs
    + process.slimmedElectrons
    + process.slimmedMuons
)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
    process.raw2digi_step,
    process.reconstruction_step,
    process.tauPath
)

to the end of the step3 configuration

makortel commented 11 months ago

With the step3 of 136.889 I got to the point that running initialStep and everything it consumes crashes (under VTune), while running any of the modules initialStep consumes individually technically works. The smoking gun here is that initialStep module is of type TrackTfClassifier that uses Tensorflow. The stack trace of the working thread contains a few interesting frames, even if it is mostly corrupted

Thread 1 (Thread 0x7f2b0d5aa740 (LWP 21048) "cmsRun"):
#0  0x00007f2b0f464ddd in poll () from /lib64/libc.so.6
#1  0x00007f2b16f75841 in ?? ()
#2  0x01007f2b0d5a4858 in ?? ()
#3  0x00007ffc13af4400 in ?? ()
#4  0x00007ffc13af4920 in ?? ()
#5  0x00007ffc13af4410 in ?? ()
#6  0x00007ffc13af44a0 in ?? ()
#7  0x00007f2b176b1060 in ?? ()
#8  0x0000002100000001 in ?? ()
#9  0x00007ffc13af4390 in ?? ()
#10 0x00007f2b0d582240 in ?? ()
#11 0x00007f2b13d241c0 in ?? ()
#12 0x000000006567babf in ?? ()
#13 0x0000000000098c43 in ?? ()
#14 0x00007f2a1c780320 in ?? ()
#15 0x00007f2a71cc753b in tensorflow::SimplePropagatorState::SimplePropagatorState(tensorflow::ImmutableExecutorState const&, long, tensorflow::ImmutableExecutorState::FrameInfo const&, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2
#16 0x00007f2ad712db5f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#17 0x00007f2ad70e650c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#18 0x00007f2ad70e6e70 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginFWCoreServicesPlugins.so
#19 0x00007f2b1715b0fe in ?? ()
#20 0x00007ffc13af5730 in ?? ()
#21 0x00007f2b15106698 in ?? ()
#22 0x000000000000000b in ?? ()
#23 0x00007ffc13af5730 in ?? ()
#24 0x00007ffc13af5170 in ?? ()
#25 0x00007f2b15106060 in ?? ()
#26 0x00007ffc13af5270 in ?? ()
#27 0x00007f2b1715c416 in ?? ()
#28 0x000000000000000b in ?? ()
#29 0x00007ffc13af5600 in ?? ()
#30 0x0000000000000000 in ?? ()

Recipe to reproduce (at CERN) is more or less

cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 136.889 --ibeos
cd 136.889_RunMET2018D
cat >>step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py <<EOF
def make(name):
    print(f"Only up to {name}")
    process.testSequence = cms.Sequence(getattr(process, name))
make("initialStep") # should crash
# immediate dependencies of initialStep
#make("offlineBeamSpot")
#make("firstStepPrimaryVertices")
#make("initialStepTracks")
process.testPath = cms.Path(process.testSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.raw2digi_step, process.reconstruction_step, process.testPath)
EOF

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- cmsRun step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py
makortel commented 11 months ago

To be again more specific for DeepTauId, I took workflow 140.201 that runs re-MINI, and was able to reproduce the behavior I saw in the earlier HLT job, i.e. including DeepTauId crashes (under VTune), but running only up to the modules consumed by the DeepTauId modules technically works.

Recipe to reproduce at CERN

cmsrel CMSSW_14_0_0_pre0
cd CMSSW_14_0_0_pre0/src
cmsenv
runTheMatrix -l 140.201 --ibeos
cd 140.201_RunJetMET2022D_reMINI
cat >>step2_PAT.py <<EOF
process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini)
# uncomment for the modules consumed by DeepTauId
#process.deepTauSequence = cms.Sequence(process.offlineSlimmedPrimaryVertices+process.packedPFCandidates+process.slimmedTausNoDeepIDs+process.slimmedElectrons+process.slimmedMuons)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(process.tauPath)
EOF

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
export SITECONFIG_PATH=/cvmfs/cms-ib.cern.ch/SITECONF/local
vtune -collect hotspots -- cmsRun step2_PAT.py
makortel commented 11 months ago

I repeated my tests in https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1832860800 and https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1832866146 with IgProf (*), and see exactly the same behavior. I.e. running up to the modules that initialStep / DeepTauId consume technically works, but running initialStep / DeepTauId results in crash.

(*) replace

source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/x86_64/2023/vtune/2023.2.0/vtune-vars.sh
vtune -collect hotspots -- <command>

with

igprof -d -t cmsRun -pp -z -o igprof.perf.gz <command>

FYI @gartung . I think we should add crash detection to the IB profiling jobs. Something along after detecting a crash in some step of the workflow, not running the subsequent steps, and communicating the failure to the IB dashboard (e.g. similarly to the HLT validation tests that show a red backround when there are failures).

makortel commented 11 months ago

With IgProf I managed to run it together with Valgrind, and got (on 136.889 step 3 customized to run initialStep and its dependencies)

valgrind: m_mallocfree.c:278 (mk_plain_bszB): Assertion 'bszB != 0' failed.
valgrind: This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

host stacktrace:
==11218==    at 0x5803F638: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5803F757: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5803F8DE: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58046B96: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58047679: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x5800485C: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58004CE7: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x58004EB9: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x580933DB: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)
==11218==    by 0x580DC945: ??? (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 11218)
==11218==    at 0x402EF11: operator new(unsigned long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/valgrind/3.17.0-cc798bb086888e3df1cae9920139b307/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==11218==    by 0x55DEC210: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**, tsl::AllocatorAttributes) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x55DEC6D0: tensorflow::OpKernelContext::allocate_output(int, tensorflow::TensorShape const&, tensorflow::Tensor**) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4A65B582: tensorflow::FusedMatMulOp<Eigen::ThreadPoolDevice, float>::Compute(tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x5606CF6A: tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615AEF0: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x56140923: void tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::RunTask<std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)> >(std::_Bind<void (tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::*(tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>*, tensorflow::SimplePropagatorState::TaggedNode, long))(tensorflow::SimplePropagatorState::TaggedNode, long)>&&, int) [clone .constprop.0] (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614636D: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614FFA4: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::NodeDone(tsl::Status const&, absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::NodeExecStatsInterface*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615A20B: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ProcessInline(tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5615BFD5: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode, long) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4F8BC99B: tsl::thread::ThreadPool::Schedule(std::function<void ()>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4F9FAD77: std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&)::{lambda(std::function<void ()>)#6}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x56146270: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::ScheduleReady(absl::lts_20220623::InlinedVector<tensorflow::SimplePropagatorState::TaggedNode, 8ul, std::allocator<tensorflow::SimplePropagatorState::TaggedNode> >*, tensorflow::SimplePropagatorState::TaggedNodeReadyQueue*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x5614AF07: tensorflow::(anonymous namespace)::ExecutorImpl::RunAsync(tensorflow::Executor::Args const&, std::function<void (tsl::Status const&)>) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_framework.so.2.12.0)
==11218==    by 0x4FA0D619: tensorflow::DirectSession::RunInternal(long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x4FA1034D: tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/tensorflow/2.12.0-548965f892775762ca90491ffd4d9cd6/lib/libtensorflow_cc.so.2.12.0)
==11218==    by 0x42F1D4D4: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolOptions const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218==    by 0x42F1D588: tensorflow::run(tensorflow::Session*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tsl::thread::ThreadPoolInterface*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so)
==11218==    by 0x5F36B736: (anonymous namespace)::TfDnn::operator()(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218==    by 0x5F36C085: TrackMVAClassifier<(anonymous namespace)::TfDnn, void>::computeMVA(std::vector<reco::Track, std::allocator<reco::Track> > const&, reco::BeamSpot const&, std::vector<reco::Vertex, std::allocator<reco::Vertex> > const&, std::vector<std::pair<float, bool>, std::allocator<std::pair<float, bool> > >&) const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so)
==11218==    by 0x5F3E9AF7: TrackMVAClassifierBase::produce(edm::Event&, edm::EventSetup const&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libRecoTrackerFinalTrackSelectors.so)
==11218==    by 0x4C6C4B0: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4C50F1D: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4BDCA18: std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4BDCFB3: edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x41E7F27: tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so)
==11218==    by 0x640B290: tbb::detail::r1::task_dispatcher::execute_and_wait(tbb::detail::d1::task*, tbb::detail::d1::wait_context&, tbb::detail::d1::task_group_context&) (task_dispatcher.h:322)
==11218==    by 0x4B6121A: edm::FinalWaitingTask::wait() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4B6ABC4: edm::EventProcessor::processRuns() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4B6B18E: edm::EventProcessor::runToCompletion() (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so)
==11218==    by 0x4074EE: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218==    by 0x63F79BC: tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) (arena.cpp:688)
==11218==    by 0x408ED1: main::{lambda()#1}::operator()() const (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
==11218==    by 0x40518B: main (in /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRun)
client stack range: [0x1FFEFAF000 0x1FFEFF6FFF] client SP: 0x1FFEFF1060
valgrind stack range: [0x1008AA6000 0x1008BA5FFF] top usage: 15736 of 1048576
makortel commented 11 months ago

Just shooting in the dark, I increased the stack size per thread from 10 MB to 20 MB with process.options.sizeOfStackForThreadsInKB = 20*1024, and with that I was able to run workflow 136.889 step 3 customized to run initialStep through IgProf. The full step 3 configuration, however, still crashed. So maybe stack size does not play a role.

VinInn commented 11 months ago

the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)

makortel commented 11 months ago

the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?)

My recent tests were indeed with jemalloc. My HLT test case crashed with glibc malloc as well (https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1817056942) when ran through VTune. I could test the offline workflows and IgProf with glibc malloc too.

VinInn commented 11 months ago

my tool shows that tensoerflow calls aligned_alloc function_task<edm:WaitingTaskList:announce():{lambda()#1}>:execute;edm:Worker:RunModuleTask<edm:OccurrenceTraits<edm:EventPrincipal,(edm:BranchActionType)1>>:execute;exception_ptr:exception_ptredm:Worker:runModuleAfterAsyncPrefetch<edm:OccurrenceTraits<edm:EventPrincipal,(edm:BranchActionType)1>>(exception_ptr:exception_ptr,edm:OccurrenceTraits<edm:EventPrincipal,(edm:BranchActionType)1>:TransitionInfoType,edm:StreamID,edm:ParentContext,edm:OccurrenceTraits<edm:EventPrincipal,;edm:WorkerT:implDo;edm:stream:EDProducerAdaptorBase:doEvent;DeepMETProducer:produce;tensorflow:run;tensorflow:run;tensorflow:DirectSession:Run;tensorflow:DirectSession:GetOrCreateExecutors;tensorflow:DirectSession:CreateExecutors;tensorflow:DirectSession:CreateGraphs;tensorflow:GraphExecutionState:BuildGraph;tensorflow:GraphExecutionState:OptimizeGraph;tensorflow:grappler:RunMetaOptimizer;tensorflow:grappler:MetaOptimizer:OptimizeConsumeItem;tensorflow:grappler:MetaOptimizer:OptimizeGraph;tensorflow:grappler:MetaOptimizer:OptimizeGraph;tensorflow:grappler:MetaOptimizer:RunOptimizer;mlir:tfg:TFGGrapplerOptimizer:Optimize;mlir:tfg:ImportGraphDef;mlir:tfg:(anonymousnamespace):GraphDefImporter:ConvertNodes(mlir:OpBuilder,mlir:tfg:;mlir:OperationState:OperationState;mlir:OperationName:OperationName;opnew;aligned_alloc;

VinInn commented 11 months ago

do the AVX2 and AVX512 builds crash as well?

makortel commented 11 months ago

Workflow 136.889 step 3 (customized to run initialStep and its dependencies) crashes with cmsRunGlibC on both IgProf and VTune. IgProf crash stack trace didn't have any useful additional information, the VTune crash is a tiny bit different, but still points to Tensorflow

*** Error in `cmsRunGlibC': malloc(): smallbin double linked list corrupted: 0x00000000459a97d0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f474)[0x7f69fda4d474]
/lib64/libc.so.6(+0x82bb0)[0x7f69fda50bb0]
/lib64/libc.so.6(__libc_malloc+0x4c)[0x7f69fda5378c]                                                                                                                                                           [30/74]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libstdc++.so.6(_Znwm+0x1c)[0x7f69fe38486c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0x358bca7)[0x7f695e3a4ca7]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow12ConcatBaseOpIN5Eigen16ThreadPoolDeviceEfLNS_16AxisArgumentNameE0EE7ComputeEPNS_15OpKernelContextE+0xff)[0x7f695e3b4cbf]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(_ZN10tensorflow16ThreadPoolDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x4b)[0x7f695a63cf6b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1258ef1)[0x7f695a72aef1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1259fd6)[0x7f695a72bfd6]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN3tsl6thread10ThreadPool8ScheduleESt8functionIFvvEE+0xfc)[0x7f69677b299c]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(+0xcad7d78)[0x7f69678f0d78]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1244271)[0x7f695a716271]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_framework.so.2(+0x1248f08)[0x7f695a71af08]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession11RunInternalElRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x7fa)[0x7f696790361a]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtensorflow_cc.so.2(_ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISD_EERKS4_ISB_SaISB_EESL_PS4_ISC_SaISC_EEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE+0x96e)[0x7f696790634e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EERKN3tsl6thread17ThreadPoolOptionsE+0xa5)[0x7f6975f794d5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libPhysicsToolsTensorFlow.so(_ZN10tensorflow3runEPNS_7SessionERKSt6vectorISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_6TensorEESaISB_EERKS2_IS9_SaIS9_EEPS2_ISA_SaISA_EEPN3tsl6thread19ThreadPoolInterfaceE+0x19)[0x7f6975f79589]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so(+0xba737)[0x7f6945080737]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginRecoTrackerFinalTrackSelectorsPlugins.so(+0xbb086)[0x7f6945081086]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libRecoTrackerFinalTrackSelectors.so(_ZN22TrackMVAClassifierBase7produceERN3edm5EventERKNS0_10EventSetupE+0xaa8)[0x7f694549caf8]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm6stream21EDProducerAdaptorBase7doEventERKNS_19EventTransitionInfoEPNS_16ActivityRegistryEPKNS_20ModuleCallingContextE+0x141)[0x7f6a009a44b1]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm7WorkerTINS_6stream21EDProducerAdaptorBaseEE6implDoERKNS_19EventTransitionInfoEPKNS_20ModuleCallingContextE+0x5e)[0x7f6a00988f1e]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8a19)[0x7f6a00914a19]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x1b8fb4)[0x7f6a00914fb4]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreConcurrency.so(+0x5f28)[0x7f6a074a8f28]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x28083)[0x7f69febfb083]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(+0x13d21b)[0x7f6a0089921b]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor11processRunsEv+0x205)[0x7f6a008a2bc5]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/libFWCoreFramework.so(_ZN3edm14EventProcessor15runToCompletionEv+0x23f)[0x7f6a008a318f]
cmsRunGlibC[0x4074ef]
/cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/external/slc7_amd64_gcc12/lib/libtbb.so.12(+0x149bd)[0x7f69febe79bd]
cmsRunGlibC[0x408ed2]
cmsRunGlibC(main+0x14c)[0x40518c]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f69fd9f0555]
cmsRunGlibC[0x405451]
======= Memory map: ========
00400000-00404000 r--p 00000000 00:31 449008759                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00404000-00405000 r-xp 00004000 00:31 449008759                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
00405000-0040a000 r-xp 00005000 00:31 449008759                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040a000-0040d000 r--p 0000a000 00:31 449008759                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040d000-0040e000 r--p 0000c000 00:31 449008759                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040e000-0040f000 rw-p 0000d000 00:31 449008759                          /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/bin/slc7_amd64_gcc12/cmsRunGlibC
0040f000-00410000 rwxp 00000000 00:00 0
014d1000-46ef0000 rw-p 00000000 00:00 0                                  [heap]
7f68d6da9000-7f68eb73c000 rw-p 00000000 00:00 0
7f68f9c51000-7f68fa061000 rwxp 00000000 00:00 0
7f68fa061000-7f68fa07d000 rw-p 00000000 00:00 0
7f68fa07d000-7f68fa0cd000 rwxp 00000000 00:00 0
7f68fa0e9000-7f68fa139000 rwxp 00000000 00:00 0
7f68fa154000-7f68fa1d6000 rwxp 00000000 00:00 0
7f68fa1d6000-7f68fe09d000 rw-p 00000000 00:00 0
7f68fe09d000-7f68fe0db000 r--p 00000000 00:31 443261081                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
7f68fe0ef000-7f68fe119000 rwxp 00000000 00:00 0
7f68fe119000-7f68fe125000 r--p 00000000 00:31 443261081                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
7f68fe125000-7f68fe13a000 r-xp 0000c000 00:31 443261081                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
7f68fe13a000-7f68fe142000 r--p 00021000 00:31 443261081                  /cvmfs/cms.cern.ch/slc7_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre0/lib/slc7_amd64_gcc12/pluginEventFilterL1TXRawToDigiAuto.so
makortel commented 11 months ago

do the AVX2 and AVX512 builds crash as well?

I haven't tested. On the machine I'm testing Tensorflow prints at the beginning of the job

2023-12-01 15:40:42.725466: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-01 15:40:42.758009: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
VinInn commented 11 months ago

I know, and do not understand what "This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations" really mean. Tensorflow does some jitting and maybe produce vector code for the specific machine. Maybe vtune ((or igprof) can be used to see if it is the case or not)

makortel commented 11 months ago

Running workflow 140.201 step 2 (customized to run DeepTauId and its dependencies) with cmsRunGlibC resulted the nan exception with IgProf

----- Begin Fatal Exception 01-Dec-2023 16:14:20 CET-----------------------
An exception of category 'DeepTauId' occurred while
   [0] Processing  Event run: 357735 lumi: 12 event: 18233675 stream: 0
   [1] Running path 'tauPath'
   [2] Calling method for module DeepTauId/'deepTau2018v2p5ForMini'
Exception Message:
invalid prediction = nan for tau_index = 0, pred_index = 0
----- End Fatal Exception -------------------------------------------------

and on VTune the job crashed inside DeepTauId module with a corrupted stack.

VinInn commented 11 months ago

I run under perf perf record cmsRunGlibC step2_PAT.py no crash. inspected the profiler: already difficult to find tensorflow functions! there are a couple of Eigen functions: all SSE code, with unaligned move instructions. nothing suspicious

VinInn commented 11 months ago

in tensorflow there are plenty of checkes of Eign alignment (for instance https://github.com/petewarden/tensorflow_makefile/blob/master/tensorflow/core/framework/tensor.h#L131 ) and it is known that on avx2/avx512 things can go wrong https://github.com/tensorflow/tensorflow/issues/15588 but as I said it seems we run sse code

th eigen code is in the tensorflow library (accoridng to perf)

 cmsRunGlibC      libtensorflow_cc.so.2.12.0                                      [.] Eigen::internal::gemm_pack_colmajor_block<float, long, Eigen::internal::TensorContractionSubMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorIm▒

I do not think that we have issues with our versions of eigen and tensorflow (or not?)

VinInn commented 11 months ago

running

 igprof -d -t cmsRunGlibC -pp -z -o igprof.perf.gz cmsRunGlibC step2_PAT.py

it crashes here

#5  0x00007f5abb3edacf in raise () from /usr/lib64/libc.so.6
#6  0x00007f5abb3c0ea5 in abort () from /usr/lib64/libc.so.6
#7  0x00007f5abb42ecd7 in __libc_message () from /usr/lib64/libc.so.6
#8  0x00007f5abb435fdc in malloc_printerr () from /usr/lib64/libc.so.6
#9  0x00007f5abb43811c in _int_free () from /usr/lib64/libc.so.6
#10 0x00007f5a27efe384 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre1/external/el8_amd64_gcc12/lib/libtensorflow_cc.so.2

notice cpu::x64::avx_gemm_f32:

VinInn commented 11 months ago

I overloadd malloc to always align to 64 [1] crashes in the same place

[1]

void *malloc(std::size_t size) {
  auto p  = aligned_alloc(64,size);
  return p;
}

void *aligned_alloc(std::size_t alignment, std::size_t size ) {
  alignment = 64;
  if (!origA) origA = (callocSym)dlsym(RTLD_NEXT,"aligned_alloc");
  assert(origA);
  auto p  = origA(alignment, size);
  return p;
}

so most probably is not a trivial alignment issue.

VinInn commented 11 months ago

btw there is also an error message "corrupted size vs. prev_size while consolidating"

I also managed to crash elsewhere with message "corrupted double-linked list"

VinInn commented 11 months ago

used CMSSW_14_0_SKYLAKEAVX512_X_2023-11-30-2300 (it uses scram_haswell libraries) did not manage to crash (with w/o igprof cmsRun cmsRunGlibC whatever) used perf record / perf report to inspect the actual code that run and I see in tensorlfow (EIGEN) clear use of AVX2 instructions (and YMM registers)

run 1000 events: no crash

VinInn commented 11 months ago

1) discovered (by trial and error) than tensorflow (and some others lib as well) uses obsolete API such as

int posix_memalign(void **memptr, size_t alignment, size_t size);
void *aligned_alloc(size_t alignment, size_t size);

did not manage to find WHERE tensorflow does that (alias in https://github.com/tensorflow/tensorflow/blob/c93b77f1740066e07ee56db6ad7f41ee42637686/tensorflow/core/platform/mem.h )

Have a new version of the malloc wrapper in https://github.com/VinInn/MallocProfiler/blob/main/plugins/mallocGuard.cc

situation still the same: AVX512 does not crash, standard release crashes in igprof usually with "corrupted size vs. prev_size while consolidating" message

VinInn commented 11 months ago

so this location

#10 0x00007f27ebefe384 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre1/external/el8_amd64_gcc12/lib/libtensorflow_cc.so.2
#11 0x00007f27ebefeffb in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre1/external/el8_amd64_gcc12/lib/libtensorflow_cc.so.2
#12 0x00007f27ec02363d in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre1/external/el8_amd64_gcc12/lib/libtensorflow_cc.so.2
#13 0x00007f27ebb48567 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre1/external/el8_amd64_gcc12/lib/libtensorflow_cc.so.2
#14 0x00007f27eb7e9d23 in dnnl_sgemm () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0_pre1/external/el8_amd64_gcc12/lib/libtensorflow_cc.so.2

is from oneAPI? https://github.com/oneapi-src/oneDNN/blob/v2.6/src/cpu/x64/gemm/f32/jit_avx_gemm_f32.cpp

VinInn commented 11 months ago

using perf record --freq=250 --call-graph=lbr cmsRun step2_PAT.py on standard release I managed to identify some JIT code + 2.77% 0.00% cmsRun [JIT] tid 371255 [.] 00007f2b9ba24000 but I did not manage to inspect the code to verify that is indeed AVX2 code.

So we are back to the usual question: can we test this on a non avx machine?

VinInn commented 11 months ago

My conclusion is that it is UNSAFE to run tensorflow built for sse on avx machines. As we have a multi-platform build we should use that.

VinInn commented 11 months ago

One alternative is to build tensorflow DISABLING oneDNN

makortel commented 11 months ago

On CMSSW_14_0_SKYLAKEAVX512_X_2023-12-03-2300 running workflow 136.889 step 3 through VTune still results in a crash (in similar way as in https://github.com/cms-sw/cmssw/issues/42444#issuecomment-1832860800), with both cmsRun and cmsRunGlibC.

smuzaffar commented 11 months ago

@makortel @VinInn , OneDNN is disabled by default e.g. looking at https://github.com/tensorflow/tensorflow/blob/v2.12.0/tensorflow/core/util/port.cc#L99-L122, I see that setting TF_ENABLE_ONEDNN_OPTS shows the following messages

`

VinInn commented 11 months ago

well with setenv TF_ENABLE_ONEDNN_OPTS 1 does not crash under igprof.... please try yourself

VinInn commented 11 months ago

@makortel could you verify on your side with vtune if you see activity in oneDNN and JIT code with and w/o explicit enabling onednn?

VinInn commented 11 months ago

The code it runs seems different thinks like this appear 4.18% 0.38% cmsRunGlibC libtensorflow_cc.so.2.12.0 [.] dnnl::impl::cpu::x64::jit_avx2_convolution_fwd_t::execute_forward(dnnl::impl::exec_ctx_t const&) const::{lambda(int, int)#1}::operator▒

and well down in the 0..5% range there is the usual dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver and the JIT code...

so not sure what TF_ENABLE_ONEDNN_OPTS really enables....

VinInn commented 11 months ago

so this is (according to perf) what is run with TF_ENABLE_ONEDNN_OPTS=0

  - 4.98% tensorflow::LaunchFusedConv2DWithOutputKernel<float>::operator()<tensorflow::BiasAddOutputKernel<float, tensorflow::Identity> >                                                                                                   ▒
      - 4.31% tensorflow::functor::SpatialConvolutionFunc<Eigen::ThreadPoolDevice, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 16, Eigen::Make▒
         - 4.29% Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorReshapingOp<Eigen::DSizes<long, 4> const, Eigen::TensorContractionOp<Eigen::a▒
            - 4.28% Eigen::TensorContractionEvaluatorBase<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorReshapingOp<Eigen::DSizes<long, 2> const, Eigen::TensorImagePatchOp▒
               - 3.59% Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eige▒
                  - 3.58% dnnl_sgemm                                                                                                                                                                                                         ▒
                     - dnnl::impl::cpu::extended_sgemm                                                                                                                                                                                       ▒
                        - 3.58% dnnl::impl::cpu::x64::gemm_driver<float, float, float>                                                                                                                                                       ▒
                           - 3.55% dnnl::impl::cpu::x64::jit_avx_gemm_f32                                                                                                                                                                    ▒
                              - dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver                                                                                                                                                      ▒
                                   1.95% 0x7f0809880000                                                                                                                                                                                      ▒
                                   1.55% 0x7f08098d0000
   - 4.53% dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver                                                                                                                                                                           ▒
        2.51% 0x7f08098d0000                                                                                                                                                                                                                 ▒
        1.95% 0x7f0809880000                                                                                                                                                                                                                 ▒
      - 0.04% dnnl::impl::malloc                                                                                                                                                                                                             ▒
         + __posix_memalign                                                                                                                                                                                                                  ▒
        0.03% dnnl::impl::cpu::x64::avx_gemm_f32::get_xbyak_gemm                                                                                                                                                                             ▒
VinInn commented 11 months ago

with TF_ENABLE_ONEDNN_OPTS=1

there is this stuff

   - 4.22% dnnl::impl::cpu::x64::jit_avx2_convolution_fwd_t::execute_forward                                                                                                                                                                 ▒
      - 4.22% dnnl::impl::parallel                                                                                                                                                                                                           ▒
         - 4.20% tensorflow::MklDnnThreadPool::parallel_for                                                                                                                                                                                  ▒
            - 4.20% tensorflow::NoThreadPool::ScheduleWithHint                                                                                                                                                                               ▒
               - 4.19% std::_Function_handler<void (), tensorflow::MklDnnThreadPool::parallel_for(int, std::function<void (int, int)> const&)::{lambda()#1}>::_M_invoke                                                                      ▒
                  - std::_Function_handler<void (int, int), dnnl::impl::parallel(int, std::function<void (int, int)> const&)::{lambda(int, int)#1}>::_M_invoke                                                                               ▒
                     - 3.82% dnnl::impl::cpu::x64::jit_avx2_convolution_fwd_t::execute_forward(dnnl::impl::exec_ctx_t const&) const::{lambda(int, int)#1}::operator()                                                                        ▒
                          0.79% 0x7fad57a44000                                                                                                                                                                                               ▒
                          0.74% 0x7fad57984000                                                                                                                                                                                               ▒

and later on

   - 0.72% dnnl::impl::cpu::extended_sgemm                                                                                                                                                                                                   ▒
      - 0.71% dnnl::impl::cpu::x64::gemm_driver<float, float, float>                                                                                                                                                                         ▒
         - 0.53% dnnl::impl::cpu::x64::jit_avx_gemm_f32                                                                                                                                                                                      ▒
              0.52% dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver                                                                                                                                                                  ▒
   - 0.52% dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver                                                                                                                                                                           ▒
        0.48% 0x7fad580b4000                                                                                                                                                                                                                 ▒
      + 0.02% dnnl::impl::malloc                                                                                                                                                                                                             ▒
        0.01% dnnl::impl::cpu::x64::avx_gemm_f32::get_xbyak_gemm                                                                                                                                                                             ▒
makortel commented 11 months ago

@makortel could you verify on your side with vtune if you see activity in oneDNN and JIT code with and w/o explicit enabling onednn?

I ran the (customized) step3 of workflow 136.889 as

TF_ENABLE_ONEDNN_OPTS=1 vtune -collect hotspots -- cmsRunGlibC step3_RAW2DIGI_L1Reco_RECO_SKIM_PAT_ALCA_DQM.py

and that resulted a crash (omitting the backtrace)

*** Error in `cmsRunGlibC': corrupted double-linked list: 0x00000000457e6920 ***

Also default cmsRun crashes, with a corrupted stack trace.

VinInn commented 11 months ago

Ok. We should try to find out if is possible to disable the jitting...