Closed tomeichlersmith closed 2 years ago
Dependent on some randomness, or some deterministic thing?
To me this looks like there might not have been a track, and thus no track collection, in the first event. The electron counter is configured not to use the simulated electron number but the counted (the number of tracks).
I suspect there is no graceful handling of the case when there is no track input collection. If this is caused by not pushing empty collections, I'm not sure (I thought we moved away from that).
Also... is this a pileup run with 1e?
I think this is pileup run with one PU electron and one ECal PN electron.
It is deterministic in the sense that I think we can replicate it - "random" in the sense of Geant4 happened to create an event that caused this issue.
I see. Then we should have 2e as the simulated electron number, if we use it, rather than the number of tracks.
It is a little surprising if we have 0 tracks in a 2e event, but not impossible. One way to check if this is where we fail, is to setup the electron counter to use the simulated electron number.
Alright, first I wanted to verify that I'm seeing this issue locally.
# compile and install v3.2.0-beta
cd .github/validation_samples/it_pileup/
ldmx setenv LDMX_RUN_NUMBER=5
ldmx setenv LDMX_NUM_EVENTS=20000
ldmx fire gen_main.py
ldmx fire gen_pileup.py
ldmx fire config.py
which produces the same result as the GitHub action.
Running with using the simulated electron number does not resolve the issue.
[ldmx] eichl008@spa-cms017 ~/ldmx/ldmx-sw/.github/validation_samples/it_pileup> git diff
diff --git a/.github/validation_samples/it_pileup/config.py b/.github/validation_samples/it_pileup/config.py
index a340b82..db0b9f4 100644
--- a/.github/validation_samples/it_pileup/config.py
+++ b/.github/validation_samples/it_pileup/config.py
@@ -82,8 +82,9 @@ ecalDigiVerify.ecalSimHitColl = ecalDigiVerify.ecalSimHitColl+overlayStr
from LDMX.Recon.electronCounter import ElectronCounter
from LDMX.Recon.simpleTrigger import TriggerProcessor
-count = ElectronCounter(1,'ElectronCounter')
+count = ElectronCounter(2,'ElectronCounter')
count.input_pass_name = ''
+count.use_simulated_electron_number = True
p.sequence.extend([
ecalDigi, ecalReco, ecalVeto,
[ldmx] eichl008@spa-cms017 ~/ldmx/ldmx-sw/.github/validation_samples/it_pileup> ldmx fire config.py
---- LDMXSW: Loading configuration --------
[2022-11-02 12:56:01.125173] [0x00007fa0f4231bc0] [info] Running configure()
[2022-11-02 12:56:07.277999] [0x00007fa0f4231bc0] [info] ElectronCounter is using parameters:
input_collection = TriggerPadTracks
input_pass_name =
output_collection = BeamElectronCount
simulated_electron_number = 2
use_simulated_electron_number = 1
---- LDMXSW: Configuration load complete --------
---- LDMXSW: Starting event processing --------
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
/etc/entry.sh: line 60: 29682 Aborted (core dumped) fire config.py
so I'm not sure what is going on anymore...
OK I looked at the config and have a few observations.
I'm surprised this runs for 4 jobs out of 5.
I'll bump the electron count default issue to milestone level.
Oh I guess it runs for the other jobs because they do see one electron track in the first event.
Apologies, but can you point out the config lines needed to be changed to resolve (1)? I thought the lines below were telling the OverlayProducer to include the TS collections.
And for (2), I think count
is being included
mh, no don't apologise, you're right, the config I looked at was not updated.
I suppose, if we're not dealing gracefully with the non-existence of digi collections in clustering, or clusters in tracking, you'd see this too. Try bumping up the verbosity of the cluster and track producer to 3 so they spit out every step.
Turning up the verbosity did not produce the error, so I'm thinking there is a memory-related bug. I re-built with CMAKE_BUILD_TYPE=Debug
and ran it through gdb
[ldmx] eichl008@spa-cms017 ~/ldmx/ldmx-sw/.github/validation_samples/it_pileup> ldmx gdb fire
<gdb header>
(gdb) run config.py
<fire output>
[ Process ] 1 : Processing 1 Run 5 Event 1 (2022-11-07 10:57:15.168801000-0600)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff64877f1 in __GI_abort () at abort.c:79
#2 0x00007ffff6adc957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff6ae2ae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff6ae2b21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff6ae2d54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff6b0b012 in std::__throw_bad_alloc() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007ffff7afaf2c in __gnu_cxx::new_allocator<ldmx::SimTrackerHit>::allocate (
this=0x7fffffffd520, __n=4099276460821455642) at /usr/include/c++/7/ext/new_allocator.h:102
#8 0x00007ffff7af9aa6 in std::allocator_traits<std::allocator<ldmx::SimTrackerHit> >::allocate (
__a=..., __n=4099276460821455642) at /usr/include/c++/7/bits/alloc_traits.h:436
#9 0x00007ffff7af6ccc in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_M_allocate (this=0x7fffffffd520, __n=4099276460821455642)
at /usr/include/c++/7/bits/stl_vector.h:172
#10 0x00007fffef92611f in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_M_create_storage (this=0x7fffffffd520, __n=4099276460821455642)
at /usr/include/c++/7/bits/stl_vector.h:187
#11 0x00007fffef922473 in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_base (this=0x7fffffffd520, __n=4099276460821455642, __a=...)
at /usr/include/c++/7/bits/stl_vector.h:138
#12 0x00007fffef91ae81 in std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::vector (this=0x7fffffffd520, __x=std::vector of length -1954685473395, capacity 0 = {...})
at /usr/include/c++/7/bits/stl_vector.h:328
#13 0x00007fffef91540f in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:308
#14 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#15 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#16 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
This looks like the SimTrackerHit
collection is literally overflowing the memory register with a size of n=4099276460821455642
. Going to look into how those collections are juggled, maybe there is an accidental extra loop somewhere...
Edit: It actually looks like a signed int to unsigned int issue (this=0x7fffffffd520, __x=std::vector of length -1954685473395, capacity 0 = {...})
Now I'm going to look at how the OverlayProducer handles the case where there happens to be no hits in one of the SimTrackerHit collections...
Slowing down and stepping through this process with gdb
cannot recreate the error :angry: I'm starting to worry that the allocation size is being mis-managed by ROOT.
#12
in the back-trace above is showing that the copy constructor is receiving a vector with a negative length and zero capacity. This is a very weird situation for a vector to be in...
After some googling, I was able to find how to set a breakpint for std::vector copy constructor:
break std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl(std::allocator<ldmx::SimTrackerHit> const&)
Then I started stepping through. Below I've copied my gdb
output and highlighted some stuff with by adding +
and triggering diff
highlighting.
#3 0x00007fffef913f73 in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
+ at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:154 # get from normal event
#4 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#5 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#6 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.
Breakpoint 2, std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
93 : _Tp_alloc_type(__a), _M_start(), _M_finish(), _M_end_of_storage()
(gdb) bt
#0 std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
#1 0x00007fffef922460 in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_base (this=0x7fffffffd520, __n=16, __a=...) at /usr/include/c++/7/bits/stl_vector.h:137
#2 0x00007fffef91ae81 in std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::vector (this=0x7fffffffd520,
+ __x=std::vector of length 16, capacity 16 = {...}) # reasonable length and capacity
at /usr/include/c++/7/bits/stl_vector.h:328
#3 0x00007fffef91540f in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
+ at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:308 # get from overlayEvent_
#4 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#5 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#6 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.
Breakpoint 2, std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
93 : _Tp_alloc_type(__a), _M_start(), _M_finish(), _M_end_of_storage()
(gdb) bt
#0 std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_impl::_Vector_impl (this=0x7fffffffd520, __a=...) at /usr/include/c++/7/bits/stl_vector.h:93
#1 0x00007fffef922460 in std::_Vector_base<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::_Vector_base (this=0x7fffffffd520, __n=16397105843294490108, __a=...)
at /usr/include/c++/7/bits/stl_vector.h:137
#2 0x00007fffef91ae81 in std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::vector (this=0x7fffffffd520,
+ __x=std::vector of length -2889106, capacity -1303127813010 = {...}) # negative length and capacity
at /usr/include/c++/7/bits/stl_vector.h:328
#3 0x00007fffef91540f in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
+ at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:308 # SECOND get from overlayEvent_
#4 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#5 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#6 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
For future reference, the below break-point is much more helpful because it only breaks when the vector input to be copied has a negative size.
break /usr/include/c++/7/bits/stl_vector.h:328 if __x.size() < 0
In an incredible stroke of weirdness, switching the order of the SimTrackerHit overlay collections makes it work.
Still think there is a memory bug somewhere, got to figure it out.
Able to run with only TaggerSimHits, only RecoilSimHits, and [ RecoilSimHits, TaggerSimHits ] (in that order). Something about the transition from TaggerSimHits -> RecoilSimHits is borked.
On this bad event TaggerSimHits does have a size of 16 (a power of two :eyes: ) but this is just numerology
Ok, I've used a non-implemented struct to have the compiler printout what it deduced for auto. This confirmed that it is not doing a weird type.
Turning on debug (p.termLogLevel = 0
) allows it to run all the way through.
Looking at the vector
destructor..
Breakpoint 2, std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::~vector (
this=0x55556869a6e0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435
435 _M_get_Tp_allocator()); }
(gdb) bt
#0 std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >::~vector (
this=0x55556869a6e0, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:435
#1 0x00007ffff7acc8d2 in ROOT::delete_vectorlEldmxcLcLSimTrackerHitgR (p=0x55556869a6e0)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/build/Framework/EventDic.cxx:3273
#2 0x00007ffff4b1f321 in TBranchElement::ReleaseObject() () from /usr/local/lib/root/libTree.so
#3 0x00007ffff4b28c10 in TBranchElement::SetAddressImpl(void*, bool) ()
from /usr/local/lib/root/libTree.so
#4 0x00007fffef92edac in framework::Bus::Passenger<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> > >::attach<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> >
> (this=0x5555689b0910, t=..., tree=0x555556383160, branch_name="RecoilSimHits_test",
can_create=false)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Bus.h:442
#5 0x00007fffef92ec4b in framework::Bus::Passenger<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> > >::attach (this=0x5555689b0910, tree=0x555556383160,
branch_name="RecoilSimHits_test", can_create=false)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Bus.h:342
#6 0x00007ffff7b1cd7e in framework::Bus::attach (this=0x5555558aba00, tree=0x555556383160,
name="RecoilSimHits_test", can_create=false)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Bus.h:107
#7 0x00007fffef9215de in framework::Event::getObject<std::vector<ldmx::SimTrackerHit, std::allocator<ldmx::SimTrackerHit> > > (this=0x5555558ab908, collectionName="RecoilSimHits", passName="test")
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Event.h:299
#8 0x00007fffef91ae17 in framework::Event::getCollection<ldmx::SimTrackerHit> (
this=0x5555558ab908, collectionName="RecoilSimHits", passName="test")
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/include/Framework/Event.h:369
#9 0x00007fffef9153fa in recon::OverlayProducer::produce (this=0x5555558ab7e0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Recon/src/Recon/OverlayProducer.cxx:310
#10 0x00007ffff7b5062c in framework::Process::process (this=0x555555890860, n=0, event=...)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:402
#11 0x00007ffff7b4f1bb in framework::Process::run (this=0x555555890860)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/Process.cxx:286
#12 0x000055555557194f in main (argc=2, argv=0x7fffffffe078)
at /export/scratch/users/eichl008/ldmx/ldmx-sw/Framework/src/Framework/fire.cxx:113
(gdb) c
Continuing.
terminate called after throwing an instance of 'std::bad_alloc'
But the ROOT delete function is also called earlier and that doesn't lead to a negative size.
ReleaseObject
is called in SetAddressImpl
if fObject
is defined (it is, we define in in Bus::attach) and if the kDeleteObject
status bit is set. I have no idea why the kDeleteObject
is set, but I am going to try to turn it off.
Patching the framework::Bus
to explicitly tell ROOT to not touch our :shit: seems to resolve this issue...
Describe the bug The it_pileup validation sample failed during run number
5
almost immediately after processing began. The full workflow log is available online for the next month (ish). The relevant parts are copied below.Interestingly, this is dependent on some randomness since runs 1, 3, and 4 of the it_pileup sample were able to complete successfully.