cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Segfaults in RecHitsSortedInPhi constructor in GPU workflows #40604

Closed makortel closed 1 year ago

makortel commented 1 year ago

The step 3 in subset of 10824.59x and 11634.59x workflows have been segfaulting in GPU IBs since CMSSW_13_0_X_2023-01-18-2300. Example stack trace

Thread 7 (Thread 0x14e9347ff700 (LWP 3429302) "cmsRun"):
#2  0x000014e9b5c68360 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014e9bbff0129 in __xstat64 () from /lib64/libc.so.6
#5  0x000014e9bc9be6c4 in stat (__statbuf=0x14e9347f7800, __path=<optimized out>) at /usr/include/sys/stat.h:455
#6  std::filesystem::status (p=..., ec=...) at ../../../../../libstdc++-v3/src/c++17/fs_ops.cc:1513
#7  0x000014e9bc9bedfc in std::filesystem::status (p=...) at ../../../../../libstdc++-v3/src/c++17/fs_ops.cc:1578
#8  0x000014e9be66f169 in (anonymous namespace)::locateFile(std::filesystem::__cxx11::path, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .isra.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreUtilities.so
#9  0x000014e9be670d39 in edm::FileInPath::initialize_() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreUtilities.so
#10 0x000014e9be672530 in edm::FileInPath::FileInPath(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreUtilities.so
#11 0x000014e905faca10 in SectorProcessorLUT::read_cppf_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned int, std::allocator<unsigned int> >&, std::vector<unsigned int, std::allocator<unsigned int> >&, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#12 0x000014e905fad9f0 in SectorProcessorLUT::read(bool, int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#13 0x000014e905f72014 in EMTFSetup::reload(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#14 0x000014e905fb0d6a in TrackFinder::process(edm::Event const&, edm::EventSetup const&, std::vector<l1t::EMTFHit, std::allocator<l1t::EMTFHit> >&, std::vector<l1t::EMTFTrack, std::allocator<l1t::EMTFTrack> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libL1TriggerL1TMuonEndCap.so
#15 0x000014e905f20f97 in L1TMuonEndCapTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginL1TriggerL1TMuonEndCapPlugins.so
#16 0x000014e9beb4259d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 6 (Thread 0x14e935643700 (LWP 3429295) "cmsRun"):
#2  0x000014e9b5c68360 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014e9bbfa02e9 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#5  0x000014e938882188 in SiPixelTemplate2D::pushfile(SiPixel2DTemplateDBObject const&, std::vector<SiPixelTemplateStore2D, std::allocator<SiPixelTemplateStore2D> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libCondFormatsSiPixelTransient.so
#6  0x000014e9388d1354 in PixelCPEClusterRepair::PixelCPEClusterRepair(edm::ParameterSet const&, MagneticField const*, TrackerGeometry const&, TrackerTopology const&, SiPixelLorentzAngle const*, SiPixelTemplateDBObject const*, SiPixel2DTemplateDBObject const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoLocalTrackerSiPixelRecHits.so
#7  0x000014e9389937ae in PixelCPEClusterRepairESProducer::produce(TkPixelCPERecord const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so
#8  0x000014e9389a0879 in edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<PixelCPEClusterRepairESProducer, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >(PixelCPEClusterRepairESProducer*, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> > (PixelCPEClusterRepairESProducer::*)(TkPixelCPERecord const&), edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> const&, edm::es::Label const&)::{lambda(TkPixelCPERecord const&)#1}, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >::runProducerAsync(tbb::detail::d1::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so

Thread 5 (Thread 0x14e936044700 (LWP 3429294) "cmsRun"):
#2  0x000014e9b5c68360 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014e9bbfa02e4 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#5  0x000014e938882188 in SiPixelTemplate2D::pushfile(SiPixel2DTemplateDBObject const&, std::vector<SiPixelTemplateStore2D, std::allocator<SiPixelTemplateStore2D> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libCondFormatsSiPixelTransient.so
#6  0x000014e9388d1354 in PixelCPEClusterRepair::PixelCPEClusterRepair(edm::ParameterSet const&, MagneticField const*, TrackerGeometry const&, TrackerTopology const&, SiPixelLorentzAngle const*, SiPixelTemplateDBObject const*, SiPixel2DTemplateDBObject const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoLocalTrackerSiPixelRecHits.so
#7  0x000014e9389937ae in PixelCPEClusterRepairESProducer::produce(TkPixelCPERecord const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so
#8  0x000014e9389a0879 in edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<PixelCPEClusterRepairESProducer, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >(PixelCPEClusterRepairESProducer*, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> > (PixelCPEClusterRepairESProducer::*)(TkPixelCPERecord const&), edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> const&, edm::es::Label const&)::{lambda(TkPixelCPERecord const&)#1}, std::unique_ptr<PixelClusterParameterEstimator, std::default_delete<PixelClusterParameterEstimator> >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator<TkPixelCPERecord> >::runProducerAsync(tbb::detail::d1::task_group*, std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so

Thread 1 (Thread 0x14e9bb42c640 (LWP 3428961) "cmsRun"):
#3  0x000014e9b5c6bb1b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x000014e94c94a724 in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#6  0x000014e94c94662c in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#7  0x000014e94c9445ca in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#8  0x000014e8ebadb559 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoTrackerTkHitPairsPlugins.so
#9  0x000014e9beb4259d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Current Modules:
Module: HitPairEDProducer:initialStepHitDoubletsPreSplitting (crashed)
Module: none
Module: L1TMuonEndCapTrackProducer:valEmtfStage2Digis
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_GPU_X_2023-01-23-2300/pyRelValMatrixLogs/run/10824.592_TTbar_13+2018_Patatrack_FullRecoGPU/step3_TTbar_13+2018_Patatrack_FullRecoGPU.log#/

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

Here is another one pointing more clearly to the crash to occur in sorting

#3  0x0000148464e82b1b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00001483fbbaed27 in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#6  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#7  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#8  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#9  0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#10 0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#11 0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#12 0x00001483fbbaed7b in void std::__introsort_loop<__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi> >(__gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, __gnu_cxx::__normal_iterator<RecHitsSortedInPhi::HitWithPhi*, std::vector<RecHitsSortedInPhi::HitWithPhi, std::allocator<RecHitsSortedInPhi::HitWithPhi> > >, long, __gnu_cxx::__ops::_Iter_comp_iter<RecHitsSortedInPhi::HitLessPhi>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#13 0x00001483fbbae2ec in RecHitsSortedInPhi::RecHitsSortedInPhi(std::vector<BaseTrackerRecHit const*, std::allocator<BaseTrackerRecHit const*> > const&, Point3DBase<float, GlobalTag> const&, DetLayer const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#14 0x00001483fbbaa62c in LayerHitMapCache::operator()(SeedingLayerSetsHits::SeedingLayer const&, TrackingRegion const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#15 0x00001483fbba85ca in HitPairGeneratorFromLayerPair::doublets(TrackingRegion const&, edm::Event const&, edm::EventSetup const&, SeedingLayerSetsHits::SeedingLayer const&, SeedingLayerSetsHits::SeedingLayer const&, LayerHitMapCache&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libRecoTrackerTkHitPairs.so
#16 0x000014839ad3b559 in (anonymous namespace)::Impl<(anonymous namespace)::DoNothing, (anonymous namespace)::ImplIntermediateHitDoublets, (anonymous namespace)::RegionsLayersSeparate>::produce(bool, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/pluginRecoTrackerTkHitPairsPlugins.so
#17 0x000014846dd6259d in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02769/el8_amd64_gcc11/cms/cmssw/CMSSW_13_0_GPU_X_2023-01-22-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Current Modules:
Module: HitPairEDProducer:initialStepHitDoubletsPreSplitting (crashed)
Module: SiStripRecHitsValid:stripRecHitsValid
Module: SiStripRecHitConverter:siStripMatchedRecHits
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_GPU_X_2023-01-23-2300/pyRelValMatrixLogs/run/10824.593_TTbar_13+2018_Patatrack_FullRecoGPU_Validation/step3_TTbar_13+2018_Patatrack_FullRecoGPU_Validation.log#/

makortel commented 1 year ago

Assign reconstruction,heterogeneous

cmsbuild commented 1 year ago

New categories assigned: heterogeneous,reconstruction

@mandrenguyen,@fwyzard,@clacaputo,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

https://github.com/cms-sw/cmssw/pull/40465 looks like a plausible culprit. Let me tag also @AdrianoDee.

AdrianoDee commented 1 year ago

Let me have a look.

makortel commented 1 year ago

@AdrianoDee Have you had a chance to take a look? In principle it would be good to have the crashes fixed for 13_0_0.

AdrianoDee commented 1 year ago

@makortel you are right. I had a look but I didn't converge. On it in the next days.

AdrianoDee commented 1 year ago

So, I still didn't understand what's happening but something strange is that I can't reproduce this in single thread and the crash occurs when any of the threads goes to the next event (so at 5th event for 4 threads, 9th for 8 and so on). If this ring a bell for somebody please let me know. Debugging is getting nasty not being able to run single threaded (also, any suggestion on how to better debug this it's very welcome).

Dr15Jones commented 1 year ago

suggestion on how to better debug this it's very welcome

Have you tried valgrind? It will also work with multiple threads.

Another thing to try would be to see if using 2 streams and 1 thread also leads to a crash.

Dr15Jones commented 1 year ago

After taking a look at the code (which ultimately is just sorting on floats which are stored as member data) it seems the most likely culprit is a NaN value as at least one of the phi values. A NaN breaks sorting since

   //to a sort 1 must be equal to nan since
   1 < nan == false;
   nan < 1 == false;
  // to a sort 2 must be equal to nan since
  2 < nan == false;
  nan < 2 == false;  

so from the transitive property of arithmetics, the sort would assume 1 == 2 as well so it expects the following `` 1 < 2 == false;


so breaks the sorting algorithm.
AdrianoDee commented 1 year ago

Thanks @Dr15Jones I was noticing the same nans too in hits' phi. Trying to track why they appear.

AdrianoDee commented 1 year ago

The problem is that localCoordToHostAsync is not taking into account the SoA layout padding to 128 alignment. And then this cudaMemcpyAsync is copying some wrong portion of memory. Still don't understand how this got unspotted. My quick fix would be:


--- a/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
+++ b/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
@@ -48,7 +48,11 @@ public:
   cms::cuda::host::unique_ptr<float[]> localCoordToHostAsync(cudaStream_t stream) const {
     auto ret = cms::cuda::make_host_unique<float[]>(4 * nHits(), stream);
     size_t rowSize = sizeof(float) * nHits();
-    cudaCheck(cudaMemcpyAsync(ret.get(), view().xLocal(), rowSize * 4, cudaMemcpyDefault, stream));
+    
+    cudaCheck(cudaMemcpyAsync(ret.get(), view().xLocal(), rowSize, cudaMemcpyDefault, stream));
+    cudaCheck(cudaMemcpyAsync(ret.get() + nHits(), view().yLocal(), rowSize, cudaMemcpyDefault, stream));
+    cudaCheck(cudaMemcpyAsync(ret.get() + nHits() * 2, view().xerrLocal(), rowSize, cudaMemcpyDefault, stream));
+    cudaCheck(cudaMemcpyAsync(ret.get() + nHits() * 3, view().yerrLocal(), rowSize, cudaMemcpyDefault, stream));

     return ret;
   }  //move to utilities
AdrianoDee commented 1 year ago

Proposed the fix in https://github.com/cms-sw/cmssw/pull/40869

AdrianoDee commented 1 year ago

Solved by https://github.com/cms-sw/cmssw/pull/40869 (and https://github.com/cms-sw/cmssw/pull/40870).

makortel commented 1 year ago

+heterogeneous