LDMX-Software / ldmx-sw

The Light Dark Matter eXperiment simulation and reconstruction framework.
https://ldmx-software.github.io
GNU General Public License v3.0
22 stars 20 forks source link

std::bad_alloc exception when running it_pileup #1111

Closed tomeichlersmith closed 2 years ago

tomeichlersmith commented 2 years ago

Describe the bug The it_pileup validation sample failed during run number 5 almost immediately after processing began. The full workflow log is available online for the next month (ish). The relevant parts are copied below.

Interestingly, this is dependent on some randomness since runs 1, 3, and 4 of the it_pileup sample were able to complete successfully.

  ---- LDMXSW: Loading configuration --------
  [2022-11-01 18:09:25.375908] [0x00007fab5b76abc0] [info]    Running configure() 
  [2022-11-01 18:09:31.470089] [0x00007fab5b76abc0] [info]    ElectronCounter is using parameters:  
    input_collection = TriggerPadTracks 
    input_pass_name =  
    output_collection = BeamElectronCount 
    simulated_electron_number = 1 
    use_simulated_electron_number = 0
  ---- LDMXSW: Configuration load complete  --------
  ---- LDMXSW: Starting event processing --------
  terminate called after throwing an instance of 'std::bad_alloc'
    what():  std::bad_alloc
  /etc/entry.sh: line 60:    77 Aborted                 (core dumped) fire config.py
bryngemark commented 2 years ago

Dependent on some randomness, or some deterministic thing?

To me this looks like there might not have been a track, and thus no track collection, in the first event. The electron counter is configured not to use the simulated electron number but the counted (the number of tracks).

I suspect there is no graceful handling of the case when there is no track input collection. If this is caused by not pushing empty collections, I'm not sure (I thought we moved away from that).

bryngemark commented 2 years ago

Also... is this a pileup run with 1e?

tomeichlersmith commented 2 years ago

I think this is pileup run with one PU electron and one ECal PN electron.

It is deterministic in the sense that I think we can replicate it - "random" in the sense of Geant4 happened to create an event that caused this issue.

bryngemark commented 2 years ago

I see. Then we should have 2e as the simulated electron number, if we use it, rather than the number of tracks.

It is a little surprising if we have 0 tracks in a 2e event, but not impossible. One way to check if this is where we fail, is to setup the electron counter to use the simulated electron number.

tomeichlersmith commented 2 years ago

Alright, first I wanted to verify that I'm seeing this issue locally.

# compile and install v3.2.0-beta
cd .github/validation_samples/it_pileup/
ldmx setenv LDMX_RUN_NUMBER=5
ldmx setenv LDMX_NUM_EVENTS=20000
ldmx fire gen_main.py
ldmx fire gen_pileup.py
ldmx fire config.py

which produces the same result as the GitHub action.

Running with using the simulated electron number does not resolve the issue.

[ldmx] eichl008@spa-cms017 ~/ldmx/ldmx-sw/.github/validation_samples/it_pileup> git diff
diff --git a/.github/validation_samples/it_pileup/config.py b/.github/validation_samples/it_pileup/config.py
index a340b82..db0b9f4 100644
--- a/.github/validation_samples/it_pileup/config.py
+++ b/.github/validation_samples/it_pileup/config.py
@@ -82,8 +82,9 @@ ecalDigiVerify.ecalSimHitColl = ecalDigiVerify.ecalSimHitColl+overlayStr
 from LDMX.Recon.electronCounter import ElectronCounter
 from LDMX.Recon.simpleTrigger import TriggerProcessor

-count = ElectronCounter(1,'ElectronCounter')
+count = ElectronCounter(2,'ElectronCounter')
 count.input_pass_name = ''
+count.use_simulated_electron_number = True

 p.sequence.extend([
     ecalDigi, ecalReco, ecalVeto,
[ldmx] eichl008@spa-cms017 ~/ldmx/ldmx-sw/.github/validation_samples/it_pileup> ldmx fire config.py 
---- LDMXSW: Loading configuration --------
[2022-11-02 12:56:01.125173] [0x00007fa0f4231bc0] [info]    Running configure() 
[2022-11-02 12:56:07.277999] [0x00007fa0f4231bc0] [info]    ElectronCounter is using parameters:  
        input_collection = TriggerPadTracks 
        input_pass_name =  
        output_collection = BeamElectronCount 
        simulated_electron_number = 2 
        use_simulated_electron_number = 1
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/etc/entry.sh: line 60: 29682 Aborted                 (core dumped) fire config.py

so I'm not sure what is going on anymore...

bryngemark commented 2 years ago

OK I looked at the config and have a few observations.

  1. you're not overlaying pileup on the TS collections. So there, there will only be the 1e sim hits available to do tracking
  2. you're not adding the electron counter process to the sequence. Then the number of electrons will be -1 (this is a bug that I've flagged before; I think it should default to 1 and not -1) which will lead to not looking up the right ecal sum limit in the trigger thresholds list

I'm surprised this runs for 4 jobs out of 5.

I'll bump the electron count default issue to milestone level.

bryngemark commented 2 years ago

Oh I guess it runs for the other jobs because they do see one electron track in the first event.

tomeichlersmith commented 2 years ago

Apologies, but can you point out the config lines needed to be changed to resolve (1)? I thought the lines below were telling the OverlayProducer to include the TS collections.

https://github.com/LDMX-Software/ldmx-sw/blob/973ac5b757892e73576aa7bb63b6ac0706cb0803/.github/validation_samples/it_pileup/config.py#L29-L31

https://github.com/LDMX-Software/ldmx-sw/blob/973ac5b757892e73576aa7bb63b6ac0706cb0803/.github/validation_samples/it_pileup/config.py#L54-L61

And for (2), I think count is being included

https://github.com/LDMX-Software/ldmx-sw/blob/973ac5b757892e73576aa7bb63b6ac0706cb0803/.github/validation_samples/it_pileup/config.py#L88-L99

bryngemark commented 2 years ago

mh, no don't apologise, you're right, the config I looked at was not updated.

bryngemark commented 2 years ago

I suppose, if we're not dealing gracefully with the non-existence of digi collections in clustering, or clusters in tracking, you'd see this too. Try bumping up the verbosity of the cluster and track producer to 3 so they spit out every step.

tomeichlersmith commented 2 years ago

Turning up the verbosity did not produce the error, so I'm thinking there is a memory-related bug. I re-built with CMAKE_BUILD_TYPE=Debug and ran it through gdb

[ldmx] eichl008@spa-cms017 ~/ldmx/ldmx-sw/.github/validation_samples/it_pileup> ldmx gdb fire
<gdb header>
(gdb) run config.py
<fire output>
 [ Process ] 1 : Processing 1 Run 5 Event 1  (2022-11-07 10:57:15.168801000-0600)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff64877f1 in __GI_abort () at abort.c:79
#2  0x00007ffff6adc957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff6ae2ae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff6ae2b21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff6ae2d54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff6b0b012 in std::__throw_bad_alloc() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7afaf2c in __gnu_cxx::new_allocator<ldmx::SimTrackerHit>::allocate (
    this=0x7fffffffd520, __n=4099276460821455642) at /usr/include/c++/7/ext/new_allocator.h:102
#8  0x00007ffff7af9aa6 in std::allocator_traits<std::allocator<ldmx::SimTrackerHit> >::allocate (
    __a=..., __n=4099276460821455642) at /usr/include/c++/7/bits/alloc_traits.h:436
#9  0x00007ffff7af6ccc in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_M_allocate (this=0x7fffffffd520, __n=4099276460821455642)
    at /usr/include/c++/7/bits/stl_vector.h:172
#10 0x00007fffef92611f in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_M_create_storage (this=0x7fffffffd520, __n=4099276460821455642)
    at /usr/include/c++/7/bits/stl_vector.h:187
#11 0x00007fffef922473 in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_base (this=0x7fffffffd520, __n=4099276460821455642, __a=...)
    at /usr/include/c++/7/bits/stl_vector.h:138
#12 0x00007fffef91ae81 in std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::vector (this=0x7fffffffd520, __x=std::vector of length -1954685473395, capacity 0 = {...})
    at /usr/include/c++/7/bits/stl_vector.h:328
#13 0x00007fffef91540f in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:308
#14 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#15 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#16 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113

This looks like the SimTrackerHit collection is literally overflowing the memory register with a size of n=4099276460821455642. Going to look into how those collections are juggled, maybe there is an accidental extra loop somewhere...

Edit: It actually looks like a signed int to unsigned int issue (this=0x7fffffffd520, __x=std::vector of length -1954685473395, capacity 0 = {...}) Now I'm going to look at how the OverlayProducer handles the case where there happens to be no hits in one of the SimTrackerHit collections...

tomeichlersmith commented 2 years ago

Slowing down and stepping through this process with gdb cannot recreate the error :angry: I'm starting to worry that the allocation size is being mis-managed by ROOT.

tomeichlersmith commented 2 years ago

#12 in the back-trace above is showing that the copy constructor is receiving a vector with a negative length and zero capacity. This is a very weird situation for a vector to be in...

tomeichlersmith commented 2 years ago

After some googling, I was able to find how to set a breakpint for std::vector copy constructor:

break std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl(std::allocator<ldmx::SimTrackerHit> const&)

Then I started stepping through. Below I've copied my gdb output and highlighted some stuff with by adding + and triggering diff highlighting.

#3  0x00007fffef913f73 in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
+    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:154 # get from normal event
#4  0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#5  0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#6  0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.

Breakpoint 2, std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
93              : _Tp_alloc_type(__a), _M_start(), _M_finish(), _M_end_of_storage()
(gdb) bt 
#0  std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
#1  0x00007fffef922460 in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_base (this=0x7fffffffd520, __n=16, __a=...) at /usr/include/c++/7/bits/stl_vector.h:137
#2  0x00007fffef91ae81 in std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::vector (this=0x7fffffffd520,
+      __x=std::vector of length 16, capacity 16 = {...}) # reasonable length and capacity
    at /usr/include/c++/7/bits/stl_vector.h:328
#3  0x00007fffef91540f in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
+    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:308 # get from overlayEvent_
#4  0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#5  0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#6  0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.

Breakpoint 2, std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
93              : _Tp_alloc_type(__a), _M_start(), _M_finish(), _M_end_of_storage()
(gdb) bt
#0  std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
#1  0x00007fffef922460 in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_base (this=0x7fffffffd520, __n=16397105843294490108, __a=...)
    at /usr/include/c++/7/bits/stl_vector.h:137
#2  0x00007fffef91ae81 in std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::vector (this=0x7fffffffd520,
+      __x=std::vector of length -2889106, capacity -1303127813010 = {...}) # negative length and capacity
    at /usr/include/c++/7/bits/stl_vector.h:328
#3  0x00007fffef91540f in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
+    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:308 # SECOND get from overlayEvent_
#4  0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#5  0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#6  0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
tomeichlersmith commented 2 years ago

For future reference, the below break-point is much more helpful because it only breaks when the vector input to be copied has a negative size.

break /usr/include/c++/7/bits/stl_vector.h:328 if __x.size() < 0
tomeichlersmith commented 2 years ago

In an incredible stroke of weirdness, switching the order of the SimTrackerHit overlay collections makes it work.

Still think there is a memory bug somewhere, got to figure it out.

tomeichlersmith commented 2 years ago

Able to run with only TaggerSimHits, only RecoilSimHits, and [ RecoilSimHits, TaggerSimHits ] (in that order). Something about the transition from TaggerSimHits -> RecoilSimHits is borked.

On this bad event TaggerSimHits does have a size of 16 (a power of two :eyes: ) but this is just numerology

tomeichlersmith commented 2 years ago

Ok, I've used a non-implemented struct to have the compiler printout what it deduced for auto. This confirmed that it is not doing a weird type.

Turning on debug (p.termLogLevel = 0) allows it to run all the way through.

tomeichlersmith commented 2 years ago

Looking at the vector destructor..


Breakpoint 2, std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::~vector (      
    this=0x55556869a6e0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435      
435                           _M_get_Tp_allocator()); }                                              
(gdb) bt                                                                                             
#0  std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::~vector (                
    this=0x55556869a6e0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435      
#1  0x00007ffff7acc8d2 in ROOT::delete_vectorlEldmxcLcLSimTrackerHitgR (p=0x55556869a6e0)            
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/build/Framework/EventDic.cxx:3273                 
#2  0x00007ffff4b1f321 in TBranchElement::ReleaseObject() () from /usr/local/lib/root/libTree.so     
#3  0x00007ffff4b28c10 in TBranchElement::SetAddressImpl(void*, bool) ()                             
   from /usr/local/lib/root/libTree.so                                                               
#4  0x00007fffef92edac in framework::Bus::Passenger<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> > >::attach<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >
 > (this=0x5555689b0910, t=..., tree=0x555556383160, branch_name="RecoilSimHits_test",               
    can_create=false)                                                                                
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Bus.h:442             
#5  0x00007fffef92ec4b in framework::Bus::Passenger<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> > >::attach (this=0x5555689b0910, tree=0x555556383160,
    branch_name="RecoilSimHits_test", can_create=false)                                              
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Bus.h:342             
#6  0x00007ffff7b1cd7e in framework::Bus::attach (this=0x5555558aba00, tree=0x555556383160,
    name="RecoilSimHits_test", can_create=false)                                
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Bus.h:107
#7  0x00007fffef9215de in framework::Event::getObject<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> > > (this=0x5555558ab908, collectionName="RecoilSimHits", passName="test")
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Event.h:299
#8  0x00007fffef91ae17 in framework::Event::getCollection<ldmx::SimTrackerHit> (          
    this=0x5555558ab908, collectionName="RecoilSimHits", passName="test")                  
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Event.h:369
#9  0x00007fffef9153fa in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:310
#10 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#11 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#12 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)   
    at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.                              
terminate called after throwing an instance of 'std::bad_alloc'

But the ROOT delete function is also called earlier and that doesn't lead to a negative size.

ReleaseObject is called in SetAddressImpl if fObject is defined (it is, we define in in Bus::attach) and if the kDeleteObject status bit is set. I have no idea why the kDeleteObject is set, but I am going to try to turn it off.

tomeichlersmith commented 2 years ago

Patching the framework::Bus to explicitly tell ROOT to not touch our :shit: seems to resolve this issue...