cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

SiPixelClusterProducer causes heap-buffer-overflow #30215

Closed Dr15Jones closed 4 years ago

Dr15Jones commented 4 years ago

In the CMSSW_11_2_ASAN_X_2020-06-11-1100 IB when running RelVal workflow 26234.0 step 3 we see a heap-buffer-overflow.

cmsbuild commented 4 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 4 years ago

assign reconstruction

Dr15Jones commented 4 years ago

Running the job after compiling with debug symbols for the module gives the following information

==28910==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f9adfed4878 at pc 0x7f9b549e9a80 bp 0x7f9b9a7d90b0 sp 0x7f9b9a7d90a8
WRITE of size 4 at 0x7f9adfed4878 thread T3
    #0 0x7f9b549e9a7f in SiPixelArrayBuffer::set_adc(int, int, int) (/build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so+0xa5a7f)
    #1 0x7f9b549e3e44 in PixelThresholdClusterizer::copy_to_buffer(__gnu_cxx::__normal_iterator<PixelDigi const*, std::vector<PixelDigi, std::allocator<PixelDigi> > >, __gnu_cxx::__normal_iterator<PixelDigi const*, std::vector<PixelDigi, std::allocator<PixelDigi> > >) /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc:279
    #2 0x7f9b549ebdca in void PixelThresholdClusterizer::clusterizeDetUnitT<edm::DetSet<PixelDigi> >(edm::DetSet<PixelDigi> const&, PixelGeomDetUnit const*, TrackerTopology const*, std::vector<short, std::allocator<short> > const&, edmNew::DetSetVector<SiPixelCluster>::FastFiller&) (/build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so+0xa7dca)
    #3 0x7f9b549e9bd9 in PixelThresholdClusterizer::clusterizeDetUnit(edm::DetSet<PixelDigi> const&, PixelGeomDetUnit const*, TrackerTopology const*, std::vector<short, std::allocator<short> > const&, edmNew::DetSetVector<SiPixelCluster>::FastFiller&) (/build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so+0xa5bd9)
    #4 0x7f9b54a1a5cc in void SiPixelClusterProducer::run<edm::DetSetVector<PixelDigi> >(edm::DetSetVector<PixelDigi> const&, edm::ESHandle<TrackerGeometry> const&, edmNew::DetSetVector<SiPixelCluster>&) /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelClusterProducer.cc:181
    #5 0x7f9b54a0ae8f in SiPixelClusterProducer::produce(edm::Event&, edm::EventSetup const&) /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelClusterProducer.cc:121
[cut for clarity]

0x7f9adfed4878 is located 680 bytes to the right of 2350544-byte region [0x7f9adfc96800,0x7f9adfed45d0)
allocated by thread T3 here:
    #0 0x7f9bdca2fdb0 in operator new(unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cc:90
    #1 0x7f9bdbedb5be in std::vector<int, std::allocator<int> >::_M_fill_insert(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, unsigned long, int const&) (/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/libDataFormatsCommon.so+0x7a5be)
    #2 0x7f9b7c99d11f in std::vector<int, std::allocator<int> >::resize(unsigned long, int const&) (/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalCaloHGCalRecProducersPlugins.so+0x8a11f)
    #3 0x7f9b549e981a in SiPixelArrayBuffer::setSize(int, int) (/build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so+0xa581a)
    #4 0x7f9b549e2f41 in PixelThresholdClusterizer::setup(PixelGeomDetUnit const*) /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc:110
    #5 0x7f9b549ebae1 in void PixelThresholdClusterizer::clusterizeDetUnitT<edm::DetSet<PixelDigi> >(edm::DetSet<PixelDigi> const&, PixelGeomDetUnit const*, TrackerTopology const*, std::vector<short, std::allocator<short> > const&, edmNew::DetSetVector<SiPixelCluster>::FastFiller&) (/build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so+0xa7ae1)
    #6 0x7f9b549e9bd9 in PixelThresholdClusterizer::clusterizeDetUnit(edm::DetSet<PixelDigi> const&, PixelGeomDetUnit const*, TrackerTopology const*, std::vector<short, std::allocator<short> > const&, edmNew::DetSetVector<SiPixelCluster>::FastFiller&) (/build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelClusterizerPlugins.so+0xa5bd9)
    #7 0x7f9b54a1a5cc in void SiPixelClusterProducer::run<edm::DetSetVector<PixelDigi> >(edm::DetSetVector<PixelDigi> const&, edm::ESHandle<TrackerGeometry> const&, edmNew::DetSetVector<SiPixelCluster>&) /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelClusterProducer.cc:181
    #8 0x7f9b54a0ae8f in SiPixelClusterProducer::produce(edm::Event&, edm::EventSetup const&) /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelClusterProducer.cc:121
cmsbuild commented 4 years ago

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 4 years ago

assign upgrade

cmsbuild commented 4 years ago

New categories assigned: upgrade

@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 4 years ago

The most likely cause of the problem is SiPixelArrayBuffer is configured with the wrong size or the arguments passed to SiPixelArrayBuffer::set_adc are inappropriate for the configured hardware.

mmusich commented 4 years ago

Since it's recent and it affects only a very specific geometry i.e. 3D pixels (@Dr15Jones please confirm it's the only wf affected), it must be linked somehow to https://github.com/cms-sw/cmssw/pull/30170. @duartej

duartej commented 4 years ago

@mmusich, I will try to check. Could you point me from where did you guess the heap-buffer-overflow is caused by the 3D?

mmusich commented 4 years ago

The only failing workflow is a 3D pixel one.

Dr15Jones commented 4 years ago

There is only one workflow which is failing this way.

duartej commented 4 years ago

It is not failing in the workflow 24834.0 as well? Or that workflow was not run?

makortel commented 4 years ago

24834.0 is run and succeeds.

duartej commented 4 years ago

This is weird, because 24834.0 is using 3D as well, right @mmusich?
Anyway, I'm looking at it, but it's kind of tricky issue...

kpedro88 commented 4 years ago

@duartej in this case, you can change the vector accessor here: https://github.com/cms-sw/cmssw/blob/3c07daea7fca7cc40dfe03570a6096800920093e/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelArrayBuffer.h#L67-L71 If you use .at(), it will check the bounds and throw an exception for out-of-bounds indices.

duartej commented 4 years ago

ok, I'll check, thanks.

I'm able to generate a crash when I compile with debug symbols, but without them there is no crash... (I'm trying to print out the relevant lines of the crash, but EOS seems to be failing...)

duartej commented 4 years ago

By just compiling with debug symbols (I have RecoLocalTracker/SiPixelRecHits and SimTracker/SiPhase2Digitizer packages), the crash appearing is:

Crash message ``` A fatal system signal has occurred: segmentation violation The following is the call stack containing the origin of the signal. Fri Jun 12 22:06:02 CEST 2020 Thread 5 (Thread 0x7f85771ff700 (LWP 30508)): #0 0x00007f8690e46a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f869142962c in __gthread_cond_wait (__mutex=, __cond=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_amd64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/gcc-8.2.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864 #2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007f8650be3a3c in Eigen::ThreadPoolTempl::WorkerLoop(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/external/slc7_amd64_gcc820/lib/libtensorflow_framework.so.2 #4 0x00007f8650be0963 in std::_Function_handler)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/external/slc7_amd64_gcc820/lib/libtensorflow_framework.so.2 #5 0x00007f869142ed2f in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80 #6 0x00007f8690e42ea5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f8690b6b8dd in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f85782b1700 (LWP 30507)): #0 0x00007f8690e46a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f869142962c in __gthread_cond_wait (__mutex=, __cond=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_amd64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/gcc-8.2.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864 #2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007f8650be3a3c in Eigen::ThreadPoolTempl::WorkerLoop(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/external/slc7_amd64_gcc820/lib/libtensorflow_framework.so.2 #4 0x00007f8650be0963 in std::_Function_handler)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/external/slc7_amd64_gcc820/lib/libtensorflow_framework.so.2 #5 0x00007f869142ed2f in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80 #6 0x00007f8690e42ea5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f8690b6b8dd in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f8578ab2700 (LWP 30506)): #0 0x00007f8690e46a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f869142962c in __gthread_cond_wait (__mutex=, __cond=) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_amd64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/gcc-8.2.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:864 #2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x00007f8650be3a3c in Eigen::ThreadPoolTempl::WorkerLoop(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/external/slc7_amd64_gcc820/lib/libtensorflow_framework.so.2 #4 0x00007f8650be0963 in std::_Function_handler)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/external/slc7_amd64_gcc820/lib/libtensorflow_framework.so.2 #5 0x00007f869142ed2f in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80 #6 0x00007f8690e42ea5 in start_thread () from /lib64/libpthread.so.0 #7 0x00007f8690b6b8dd in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f8677064700 (LWP 29270)): #0 0x00007f8690e4a1d9 in waitpid () from /lib64/libpthread.so.0 #1 0x00007f8683a3fa47 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so #2 0x00007f8683a4050a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so #3 0x00007f869142ed2f in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80 #4 0x00007f8690e42ea5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f8690b6b8dd in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f868f01f4c0 (LWP 29185)): #0 0x00007f8690b60c3d in poll () from /lib64/libc.so.6 #1 0x00007f8683a3feaf in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so #2 0x00007f8683a405ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so #3 0x00007f8683a414c9 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so #4 #5 0x00007f869208fc32 in atomic_load_p (mo=atomic_memory_order_acquire, a=) at include/jemalloc/internal/atomic_gcc_atomic.h:31 #6 extent_arena_get (extent=0x0) at include/jemalloc/internal/extent_inlines.h:51 #7 je_large_dalloc (tsdn=tsdn@entry=0x7f868f018878, extent=0x0) at src/large.c:361 #8 0x00007f869204dcbd in arena_dalloc_large (slow_path=, szind=, tcache=, ptr=, tsdn=) at include/jemalloc/internal/arena_inlines_b.h:281 #9 arena_dalloc (slow_path=, alloc_ctx=, tcache=, ptr=, tsdn=) at include/jemalloc/internal/arena_inlines_b.h:323 #10 idalloctm (slow_path=, is_internal=, alloc_ctx=, tcache=, ptr=, tsdn=) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:118 #11 ifree (slow_path=, tcache=, ptr=, tsd=) at src/jemalloc.c:2589 #12 je_free_default (ptr=) at src/jemalloc.c:2799 #13 0x00007f869204e367 in free (ptr=) at src/jemalloc.c:2867 #14 0x00007f86920b44a5 in operator delete (ptr=) at src/jemalloc_cpp.cpp:107 #15 0x00007f8669eed63e in __gnu_cxx::new_allocator::deallocate (this=0x7f8583774e70, __p=) at /cvmfs/cms-ib.cern.ch/nweek-02632/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/include/c++/8.4.0/ext/new_allocator.h:116 #16 std::allocator_traits >::deallocate (__a=..., __n=, __p=) at /cvmfs/cms-ib.cern.ch/nweek-02632/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/include/c++/8.4.0/bits/alloc_traits.h:462 #17 std::_Vector_base >::_M_deallocate (this=0x7f8583774e70, __n=, __p=) at /cvmfs/cms-ib.cern.ch/nweek-02632/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/include/c++/8.4.0/bits/stl_vector.h:304 #18 std::_Vector_base >::~_Vector_base (this=0x7f8583774e70, __in_chrg=) at /cvmfs/cms-ib.cern.ch/nweek-02632/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/include/c++/8.4.0/bits/stl_vector.h:285 #19 std::vector >::~vector (this=0x7f8583774e70, __in_chrg=) at /cvmfs/cms-ib.cern.ch/nweek-02632/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/include/c++/8.4.0/bits/stl_vector.h:570 #20 PixelCPEBase::~PixelCPEBase (this=0x7f8583774e00, __in_chrg=) at /afs/cern.ch/user/d/duarte/CMSSW/CMSSW_11_2_X_2020-06-07-0000/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:54 #21 PixelCPEGeneric::PixelCPEGeneric (this=0x7f8583774e00, conf=..., mag=, geom=..., ttopo=..., lorentzAngle=, genErrorDBObject=0x0, lorentzAngleWidth=0x7f858375aaa0) at /afs/cern.ch/user/d/duarte/CMSSW/CMSSW_11_2_X_2020-06-07-0000/src/RecoLocalTracker/SiPixelRecHits/src/PixelCPEGeneric.cc:37 #22 0x00007f8669f7c02f in PixelCPEGenericESProducer::produce(TkPixelCPERecord const&) () from /afs/cern.ch/user/d/duarte/CMSSW/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so #23 0x00007f8669f7ecb7 in edm::eventsetup::CallbackProxy >, TkPixelCPERecord, edm::eventsetup::CallbackSimpleDecorator >, TkPixelCPERecord, std::unique_ptr > >::getImpl(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, edm::EventSetupImpl const*) () from /afs/cern.ch/user/d/duarte/CMSSW/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so #24 0x00007f86935ca3e7 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #25 0x00007f8693629d45 in edm::eventsetup::EventSetupRecordImpl::getFromProxy(edm::ESProxyIndex, bool, edm::eventsetup::ComponentDescription const*&, edm::eventsetup::DataKey const*&, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #26 0x00007f8669f84eb9 in cms::SiPixelRecHitConverter::produce(edm::Event&, edm::EventSetup const&) () from /afs/cern.ch/user/d/duarte/CMSSW/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/pluginRecoLocalTrackerSiPixelRecHitsPlugins.so #27 0x00007f8693713416 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #28 0x00007f86936f0be3 in edm::WorkerT::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #29 0x00007f869365af0a in decltype ({parm#1}()) edm::convertException::wrap >(edm::OccurrenceTraits::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}>(bool edm::Worker::runModule >(edm::OccurrenceTraits::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #30 0x00007f869365b0de in bool edm::Worker::runModule >(edm::OccurrenceTraits::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #31 0x00007f869365b43b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #32 0x00007f869365cadb in edm::Worker::RunModuleTask >::execute() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #33 0x00007f8691e1525d in tbb::internal::custom_scheduler::process_bypass_loop (this=this@entry=0x7f868d9b3200, context_guard=..., t=t@entry=0x7f8576218d40, isolation=isolation@entry=0) at ../../src/tbb/custom_scheduler.h:388 #34 0x00007f8691e15555 in tbb::internal::custom_scheduler::local_wait_for_all (this=0x7f868d9b3200, parent=..., child=) at ../../include/tbb/task.h:992 #35 0x00007f86935ea165 in edm::EventProcessor::processLumis(std::shared_ptr const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #36 0x00007f86935f1bde in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-06-07-0000/lib/slc7_amd64_gcc820/libFWCoreFramework.so #37 0x0000000000412d9b in main::{lambda()#1}::operator()() const () #38 0x0000000000411362 in main () Current Modules: Module: SiPixelRecHitConverter:siPixelRecHits (crashed) A fatal system signal has occurred: segmentation violation Segmentation fault (core dumped) ```

I checked the .at() method in SiPixelArrayBuffer.h as @kpedro88 suggested and the crash is exactly the same as before.

void SiPixelArrayBuffer::set_adc(int row, int col, int adc) { pixel_vec.at(index(row, col)) = adc; }
void SiPixelArrayBuffer::set_adc(const SiPixelCluster::PixelPos& pix, int adc) { pixel_vec.at(index(pix)) = adc; }
void SiPixelArrayBuffer::add_adc(int row, int col, int adc) { pixel_vec.at(index(row, col)) += adc; }

I will continue investigating..

duartej commented 4 years ago

I changed the geometry in my config file to use one without 3D:

#process.load('Configuration.Geometry.GeometryExtended2026D54Reco_cff')
process.load('Configuration.Geometry.GeometryExtended2026D57Reco_cff')
...
#process.GlobalTag = GlobalTag(process.GlobalTag, 'auto:phase2_realistic_T19', '')
process.GlobalTag = GlobalTag(process.GlobalTag, 'auto:phase2_realistic_T15', '')

and there is a segmentation violation as well, exactly the same than in the previous comment.

If I compile without USER_CXXFLAGS="-g -D=EDM_ML_DEBUG" flag, there is no crash in any of the tested geometries...

mmusich commented 4 years ago

@duartej I am confused by your tests. Let me state some facts:

What kind of test config did you use?

duartej commented 4 years ago

@mmusich, the #30170 as you can check, involves 1 line of code and a very precise change. Despite it could be always collateral effects, even more I'm not an expert on that software; I would be very surprise that this change is direct cause of the reported crash. However, and as you said, is failing in a workflow of 3D and in the first IB after the #30170 merging:

  1. I've try to track back the potential incidence of the change introduced in #30170, and I didn't manage to find anything.
  2. Next, I included the debug compilation to try to reproduce the error quoted in https://github.com/cms-sw/cmssw/issues/30215#issuecomment-643349013 and before downloading the SiPixelClusterizer package, I found a crash (segmentation violation present in the RecoLocalTracker/SiPixelRecHits package. Without debug symbols, the job worked fine. This was suspicious...
  3. However, I downloaded the RecoLocalTracker/SiPixelClusterizer to try to reproduce exactly the error, but I still found the segmentation violation in SiPixelRecHits. So, I concentrate myself in trying to resolve this issue because I suspect should be related with the clusterize one. In any case I need to resolve it first to get rid of the error.
  4. I change the vector operator[] to use the safest at (as suggested in https://github.com/cms-sw/cmssw/issues/30215#issuecomment-643389702) but no bound error appears, still segmentation violation
  5. I switched the geometry to a fully planar one (T15), and the segmentation violation remains
  6. Given that the error appears with the debug compilation, the LogDebug code present should be triggering the error... I'm checking this right now

I'm using a digitization + reco job over a 4Mu generated sample, I can post here the config if you want.

Of course, please feel free to suggest me any other approach you consider.

makortel commented 4 years ago

6. Given that the error appears with the debug compilation, the LogDebug code present should be triggering the error... I'm checking this right now

Debug symbols and LogDebug should be independent of each other. That is, LogDebug compiles to nothing unless EDM_ML_DEBUG macro is enabled. Did you enable it?

Of course, please feel free to suggest me any other approach you consider.

26234.0 is ttbar, I would try to reproduce the problem and investigate by running it.

VinInn commented 4 years ago

what about valgrind? looks lke a Major (random) memory corruption

Dr15Jones commented 4 years ago

The error is causing the code to write to memory it does not own. Building with debug symbols shifts memory around and therefore can cause this problem to manifest differently. That can lead to a crash going away or a crash appearing. The only tools I’ve found that can reliably help with the problem are ASAN or valgrind.

Dr15Jones commented 4 years ago

I made the following change

diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc
index 0871fececa2..993799c2269 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc
@@ -276,7 +276,12 @@ void PixelThresholdClusterizer::copy_to_buffer(DigiIterator begin, DigiIterator
     */

     if (adc >= thePixelThreshold) {
-      theBuffer.set_adc(row, col, adc);
+      if(theBuffer.inside(row,col)) {
+          theBuffer.set_adc(row, col, adc);
+        } else {
+          std::cout <<" OUTSIDE OF BOUNDS row: "<<row<<" col: "<<col<<std::endl;
+          assert(theBuffer.inside(row,col));
+        }
       if (adc >= theSeedThreshold)
         theSeeds.push_back(SiPixelCluster::PixelPos(row, col));
     }

then running the workflow in question in the ASAN IB I got

Begin processing the 1st record. Run 1, Event 3, LumiSection 1 on stream 0 at 13-Jun-2020 16:27:02.414 CEST
 OUTSIDE OF BOUNDS row: 170 col: 434
cmsRun: /build/chrjones/asan/CMSSW_11_2_ASAN_X_2020-06-11-1100/src/RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc:283: void PixelThresholdClusterizer::copy_to_buffer(PixelClusterizerBase::DigiIterator, PixelClusterizerBase::DigiIterator): Assertion `theBuffer.inside(row,col)' failed.
mmusich commented 4 years ago

@duartej FYI I could reproduce the error by doing:

cmsrel CMSSW_11_2_ASAN_X_2020-06-11-1100
cd CMSSW_11_2_ASAN_X_2020-06-11-1100/src/
cmsenv
runTheMatrix.py -l 26234.0 -t 4 -j 8

that fails at step3 as reported in the issue. Also about what reported by @Dr15Jones

 OUTSIDE OF BOUNDS row: 170 col: 434

Normally a pixel phase-2 module should have either: 1354*434 or 672*434 pixels in it (see https://github.com/cms-sw/cmssw/blob/master/Geometry/TrackerCommonData/data/PhaseII/TiltedTracker613/pixelStructureTopology.xml), so that digi at col 434 should not exist.

@emiglior

duartej commented 4 years ago

Thanks for the catch @Dr15Jones and for the instructions to reproduce it @mmusich.

Indeed the problem is an out of bounds digi created at the pixel endcap. I think the problem appears because the digitization for the forward region is not properly implemented yet, as it is explained at the description (and slides linked there) of the PR-https://github.com/cms-sw/cmssw/pull/29018 , and probably the digitization of the energy deposit was made with wrong assumptions.

This is the module were the digi is outside bounds:

PixelThresholdCLusterizerPixelEndcap Side   1- Disk   5 Blade  1 Panel  1 Module 1 (345248772)
OUTSIDE BOUNDS row:170 col: 434

But I'm not fully sure about that because if the issue is related with the Forward implementation, why the problem wasn't spotted initially instead of now? In fact the tag CMSSW_11_2_ASAN_X_2020-06-11-1100 doesn't seem to contain the PR https://github.com/cms-sw/cmssw/pull/30170 (at least I'm adding the SiTracker/SiPhase2DIgitizer package in that tag and I cannot find my changes...). Could you confirm that is the case?

So I'm a bit lost here, I can try to include a proper implementation of the Pixel3D digitizer for the Forward region and I can prioritize this work, but I'm not sure when it could be finished. In parallel, I would try to follow up the digi creation and see if I can understand exactly why is laying outside the boundaries of the chip.

VinInn commented 4 years ago

In any case I think one shall implement a boundary check in the digitizer (numerical precision can always cause a "overflow"). Corrective action can be either skip or assign to "closest"

duartej commented 4 years ago

I can confirm that the issue appears when the simulated particle pass through exactly the edge of the sensitive area (y=21.725 mm):

PSimHit-entry point:  (-0.415714,2.1725,-0.00476743)  exit point:  (-0.398674,2.1725,-0.00744566)  Track id: 222212 detUnitId: 345248772

See in tkLayout cfg the sensor width (43.45 mm)/(2 ROCs) = 21.725 mm

The digitizer takes the column number provided by the RectangularPixelTopology function std::pair<float, float> pixel(const LocalPoint& p) const override;, and the 3D digitizer assumes that function provides a valid bounded number, which is not the case.

Why the planar digitizer is not propagating that wrong column number while 3D digitizer does? The planar digitizer introduces a check for the diffused charge not to be outside the boundary pixels (see L454-L457). This check is introduced because there is a pixel charge migration on the digitizer (Lorentz angle, diffusion), but this check absorbs any initial wrong pixel assignation, as it was due to the digitizer charge migration.

The 3D digitizer does not migrate charge by Lorentz angle (although it does by diffusion, the actual acceptance for doing it is very low), therefore no check was introduced, although this check must be introduced because of the diffusion. However, even with that check the pathological case we are discussing would be failing as well because there is no diffusion involved.

Therefore, I can introduce a check as @VinInn is suggesting in the digitizer. However, I think this would be hiding the real issue which is why the topological classes did not correct that boundary particles. Unless that's not the job of those classes and by design should be the digitizer job to check for boundaries, even if the digitizer does not change the pixel position. I suppose this could be the case if the digitizer is the only client of those topological classes, otherwise anyone using those classes, should check about boundaries (which is fine, but I'm not sure if this is intended, at least I did not find anything explicitly saying so, and I did find a kind of logic to avoid outside pixel which clearly is not working)

Please, let me know then if you want me to introduce this check in the Pixel 3D digitizer (which I would do for the diffusion case anyway, although depending your answer, checking as well for the initial pixel) or follow up with this discussion.

VinInn commented 4 years ago

One can easily verify the clients of RectPixTop with https://cmssdt.cern.ch/dxr/CMSSW/search?q=%2Bcallers%3A%22RectangularPixelTopology%3A%3Apixel%28const+LocalPoint+%26%29+const%22

In reality the main client is reconstruction.

Notably from here https://cmssdt.cern.ch/dxr/CMSSW/source/RecoTracker/MeasurementDet/plugins/TkPixelMeasurementDet.cc#103

SO it is a feature to return an out-of-bound (pixel is a unit of measurement as any other, can be negative as well) and any check shall be implemented in the client

VinInn commented 4 years ago

I shall admin that this one https://cmssdt.cern.ch/dxr/CMSSW/source/Geometry/TrackerGeometryBuilder/interface/RectangularPixelTopology.h#100 will of course produce bizarre results in case of negative or too large values... (apparently used in simulation or validation only https://cmssdt.cern.ch/dxr/CMSSW/search?q=%2Bcallers%3A%22RectangularPixelTopology%3A%3Achannel%28const+LocalPoint+%26%29+const%22 ) maybe one wish to introduce min and max (in any case shifting and masking negative int does not necessarily produce the result one naively expect)

duartej commented 4 years ago

@VinInn also https://cmssdt.cern.ch/dxr/CMSSW/source/SimFastTiming/FastTimingCommon/src/ETLDeviceSim.cc#58 , where it will introduce a wrong pixel number into the collection of hits...

duartej commented 4 years ago

SO it is a feature to return an out-of-bound (pixel is a unit of measurement as any other, can be negative as well) and any check shall be implemented in the client

Sorry, I disagree, pixel is a ROC unit, and the limit are defined by the number of channels of the ROC.

duartej commented 4 years ago

So, what's the decision here? Should I include the bound check in the 3d digitizer?

VinInn commented 4 years ago

You may disagree, still this is how it is used in reconstruction and RecoLocalTracker belongs to reconstruction (and you may have noticed that the return type is float, so it is a measurement unit, not an index in the ROC and in any case "pixel" is at detector level, not at ROC level). My suggest is to introduce a protection in the digitization algorithm that is bound to return valid digis.

duartej commented 4 years ago

I still disagree with your arguments, but as you implied, and I agree with that, this is not my job to decide how is pixel function used or interpreted.

OK, I will implement the check and post here the changes and the tests before request to merge.

duartej commented 4 years ago

I've included the following code in the digitizer to move back to the edge col/row any out-of-bounds signal:

diff --git a/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc b/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc
index 9a1111b4176..735cc026b02 100644
--- a/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc
+++ b/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc
@@ -268,6 +268,9 @@ std::vector<DigitizerUtility::SignalPoint> Pixel3DDigitizerAlgorithm::drift(
   const auto pitch = pixdet->specificTopology().pitch();
   const auto half_pitch = std::make_pair<float, float>(pitch.first * 0.5, pitch.second * 0.5);
   const float thickness = pixdet->specificSurface().bounds().thickness();
+  const int nrows = pixdet->specificTopology().nrows();
+  const int ncolumns = pixdet->specificTopology().ncolumns();
+  const float pix_rounding = 0.99;

   // the maximum radial distance is going to be use to evaluate radiation damage XXX?
   const float max_radial_distance =
@@ -300,7 +303,29 @@ std::vector<DigitizerUtility::SignalPoint> Pixel3DDigitizerAlgorithm::drift(

   for (const auto& super_charge : ionization_points) {
     // Extract the pixel cell
-    const auto current_pixel = pixdet->specificTopology().pixel(LocalPoint(super_charge.x(), super_charge.y()));
+    auto current_pixel = pixdet->specificTopology().pixel(LocalPoint(super_charge.x(), super_charge.y()));
+    // `pixel` function does not check to be in the ROC bounds,
+    // so check it here and fix potential rounding problems.
+    // Careful, this is assuming a rounding problem (1 unit), more than 1 pixel
+    // away is probably showing some backward problem worth it to track.
+    // This is also correcting out of bounds migrated charge from diffusion.
+    // The charge will be moved to the edge of the row/column.
+    if (current_pixel.first >= nrows)
+    {
+        current_pixel.first = (nrows-1)+pix_rounding;
+    }
+    if (current_pixel.first < 0)
+    {
+        current_pixel.first = 0.0;
+    }
+    if (current_pixel.second >= ncolumns)
+    {
+        current_pixel.second = (ncolumns-1)+pix_rounding;
+    }
+    if (current_pixel.second < 0)
+    {
+        current_pixel.second = 0.0;
+    }
     const auto current_pixel_int = std::make_pair(std::floor(current_pixel.first), std::floor(current_pixel.second));

And tested with the same workflow/CMSSW than https://github.com/cms-sw/cmssw/issues/30215#issue-637848361 with no problems:

runTheMatrix.py -l 26234.0 -t 4 -j 8
...
26234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D55_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D55+RecoFullGlobal_2026D55+HARVESTFullGlobal_2026D55 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED  - time date Tue Jun 16 09:20:46 2020-date Tue Jun 16 09:07:00 2020; exit: 0 0 0 0
1 1 1 1 tests passed, 0 0 0 0 failed

Please let me know if you want me to create a PR with this commit.

mmusich commented 4 years ago

@duartej what about using std::clamp instead of the ifs

duartej commented 4 years ago

@mmusich sure, I suppose, but it's available since C++17 standard, is that ok?

mmusich commented 4 years ago

but it's available since C++17 standard, is that ok?

yes it's already widely in use in cmssw https://github.com/cms-sw/cmssw/search?q=std%3A%3Aclamp&unscoped_q=std%3A%3Aclamp

duartej commented 4 years ago

Ok, I've re-implemented with std::clamp (thanks @mmusich, much more compact for sure). Also I tested successfully with runTheMatrix.py -l 26234.0 -t 4 -j 8.

diff --git a/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc b/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc
index 9a1111b4176..b5e107d4f78 100644
--- a/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc
+++ b/SimTracker/SiPhase2Digitizer/plugins/Pixel3DDigitizerAlgorithm.cc
@@ -268,6 +268,9 @@ std::vector<DigitizerUtility::SignalPoint> Pixel3DDigitizerAlgorithm::drift(
   const auto pitch = pixdet->specificTopology().pitch();
   const auto half_pitch = std::make_pair<float, float>(pitch.first * 0.5, pitch.second * 0.5);
   const float thickness = pixdet->specificSurface().bounds().thickness();
+  const int nrows = pixdet->specificTopology().nrows();
+  const int ncolumns = pixdet->specificTopology().ncolumns();
+  const float pix_rounding = 0.99;

   // the maximum radial distance is going to be use to evaluate radiation damage XXX?
   const float max_radial_distance =
@@ -300,7 +303,16 @@ std::vector<DigitizerUtility::SignalPoint> Pixel3DDigitizerAlgorithm::drift(

   for (const auto& super_charge : ionization_points) {
     // Extract the pixel cell
-    const auto current_pixel = pixdet->specificTopology().pixel(LocalPoint(super_charge.x(), super_charge.y()));
+    auto current_pixel = pixdet->specificTopology().pixel(LocalPoint(super_charge.x(), super_charge.y()));
+    // `pixel` function does not check to be in the ROC bounds,
+    // so check it here and fix potential rounding problems.
+    // Careful, this is assuming a rounding problem (1 unit), more than 1 pixel
+    // away is probably showing some backward problem worth it to track.
+    // This is also correcting out of bounds migrated charge from diffusion.
+    // The charge will be moved to the edge of the row/column.
+    current_pixel.first  = std::clamp(current_pixel.first, float(0.0), (nrows-1)+pix_rounding);
+    current_pixel.second = std::clamp(current_pixel.second, float(0.0), (ncolumns-1)+pix_rounding);
+
     const auto current_pixel_int = std::make_pair(std::floor(current_pixel.first), 

Please, let me know if you want me to PR it.

mmusich commented 4 years ago

@duartej yes please go ahead. Thanks

mmusich commented 4 years ago

Dear all, as far as I can see in CMSSW_11_2_ASAN_X_2020-06-19-2300 wf 26234.0 step 3 runs fine, see log. Can please the relevant CMSSW L2s sign this issue and close it?

kpedro88 commented 4 years ago

+upgrade

perrotta commented 4 years ago

+1

cmsbuild commented 4 years ago

This issue is fully signed and ready to be closed.