DUNE / dune-tms

DUNE ND Temporary Muon Spectrometer
0 stars 1 forks source link

ConvertToTMSTree.exe crashes a lot #158

Closed jdkio closed 2 weeks ago

jdkio commented 3 weeks ago

Running over 10 files of /pnfs/dune/persistent/users/abooth/nd-production/MicroProdN1p2/output/run-spill-build/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu/EDEPSIM_SPILLS/0000000/0000100 results in 3 crashes. At least it did on 9/12, with a slightly modified version of main. Modified in the sense that most changes made it into PR #152, and the rest were printing extra info. I'm getting a higher failure rate now using main

Try this and see what happens

for run in $(seq 100 110); do
echo "Running run ${run}:"
ConvertToTMSTree.exe /pnfs/dune/persistent/users/abooth/nd-production/MicroProdN1p2/output/run-spill-build/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu/EDEPSIM_SPILLS/0000000/0000100/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu.0000${run}.EDEPSIM_SPILLS.root &> regular_run_${run}.log
done 
jdkio commented 3 weeks ago

This seems to be an AL9 issue. I'm getting a 100% crash rate for it. The SL7 version crashes sometimes (which we have to work on) but it's not 100%

jdkio commented 3 weeks ago

Here's the gdb backtrace for one of the errors

#0  0x00007ffff4a8b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff4a3e646 in raise () from /lib64/libc.so.6
#2  0x00007ffff4a287f3 in abort () from /lib64/libc.so.6
#3  0x00007ffff4a29130 in __libc_message.cold () from /lib64/libc.so.6
#4  0x00007ffff4a959f7 in malloc_printerr () from /lib64/libc.so.6
#5  0x00007ffff4a9755c in _int_free () from /lib64/libc.so.6
#6  0x00007ffff4a99d35 in free () from /lib64/libc.so.6
#7  0x00007ffff7e67652 in TG4Event::~TG4Event (this=0x7ffffff9f4e0, __in_chrg=<optimized out>) at /tmp/losulliv/spack-stage/spack-stage-edep-sim-3.2.0-fnwu3lpodqtb6zh6eqko6uaxtg5fajo5/spack-src/io/TG4Event.cxx:7
#8  0x000000000040e6a5 in ConvertToTMSTree (filename=..., output_filename=...) at ConvertToTMSTree.cpp:143
#9  0x000000000040919d in main (argc=<optimized out>, argv=<optimized out>) at ConvertToTMSTree.cpp:292
jdkio commented 3 weeks ago

For sl7, there's a wider variety of errors: ConvertToTMSTree.exe /pnfs/dune/persistent/users/abooth/nd-production/MicroProdN1p2/output/run-spill-build/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu/EDEPSIM_SPILLS/0000000/0000100/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu.0000123.EDEPSIM_SPILLS.root

 *** Break *** segmentation violation

#6  0x000000000040f041 in std::_Destroy<TVector3> (__pointer=0x12d42920) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:98
#7  std::_Destroy_aux<false>::__destroy<TVector3*> (__last=<optimized out>, __first=0x12d42920) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:108
#8  std::_Destroy<TVector3*> (__last=<optimized out>, __first=<optimized out>) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:137
#9  std::_Destroy<TVector3*, TVector3> (__last=0x12d429e8, __first=<optimized out>) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:206
#10 std::vector<TVector3, std::allocator<TVector3> >::~vector (this=0x1387fa08, __in_chrg=<optimized out>) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h:677
#11 TMS_TrueParticle::~TMS_TrueParticle (this=0x1387f9d8, __in_chrg=<optimized out>) at ../src/TMS_TrueParticle.h:20
#12 std::_Destroy<TMS_TrueParticle> (__pointer=0x1387f9d8) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:98
#13 std::_Destroy_aux<false>::__destroy<TMS_TrueParticle*> (__last=<optimized out>, __first=0x1387f9d8) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:108
#14 std::_Destroy<TMS_TrueParticle*> (__last=<optimized out>, __first=<optimized out>) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:137
#15 std::_Destroy<TMS_TrueParticle*, TMS_TrueParticle> (__last=0x13887dd0, __first=<optimized out>) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:206
#16 std::vector<TMS_TrueParticle, std::allocator<TMS_TrueParticle> >::~vector (this=0x7ffc3649e820, __in_chrg=<optimized out>) at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h:677
#17 TMS_Event::~TMS_Event (this=0x7ffc3649e800, __in_chrg=<optimized out>) at ../src/TMS_Event.h:19
#18 0x000000000040acb3 in ConvertToTMSTree (filename=..., output_filename=...) at ConvertToTMSTree.cpp:216
#19 0x00000000004083e6 in main (argc=<optimized out>, argv=<optimized out>) at ConvertToTMSTree.cpp:292

ConvertToTMSTree.exe /pnfs/dune/persistent/users/abooth/nd-production/MicroProdN1p2/output/run-spill-build/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu/EDEPSIM_SPILLS/0000000/0000100/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu.0000101.EDEPSIM_SPILLS.root

*** Error in `ConvertToTMSTree.exe': free(): invalid next size (normal): 0x00000000112b7240 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7f78d2e97329]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN7TBufferD1Ev+0x2f)[0x7f78d
91b540f]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libRIO.so(_ZN11TBufferFileD0Ev+0x12)[0x7
f78d86da372]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN7TBranchD2Ev+0x268)[0x7f78
d66b63b8]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN14TBranchElementD0Ev+0x12)
[0x7f78d66c26e2]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN9TObjArray6DeleteEPKc+0x74
)[0x7f78d925da84]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN7TBranchD2Ev+0x164)[0x7f78
d66b62b4]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN14TBranchElementD0Ev+0x12)
[0x7f78d66c26e2]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN9TObjArray6DeleteEPKc+0x74
)[0x7f78d925da84]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN7TBranchD2Ev+0x164)[0x7f78
d66b62b4]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN14TBranchElementD0Ev+0x12)
[0x7f78d66c26e2]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN9TObjArray6DeleteEPKc+0x74
)[0x7f78d925da84]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN5TTreeD2Ev+0x1f1)[0x7f78d6
72d2f1]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN5TTreeD0Ev+0x12)[0x7f78d67
2d902]
ConvertToTMSTree.exe(_Z16ConvertToTMSTreeNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_+0x310e)[0x40b78e]
ConvertToTMSTree.exe(main+0xc6)[0x4083e6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f78d2e38555]
ConvertToTMSTree.exe[0x40852d]
======= Memory map: ========
00400000-00419000 r-xp 00000000 00:3a 3298814962713                      /exp/dune/app/users/kleykamp/tms_validation/dune-tms/bin/ConvertT
oTMSTree.exe
00619000-0061a000 r--p 00019000 00:3a 3298814962713                      /exp/dune/app/users/kleykamp/tms_validation/dune-tms/bin/ConvertT
oTMSTree.exe
0061a000-0061b000 rw-p 0001a000 00:3a 3298814962713                      /exp/dune/app/users/kleykamp/tms_validation/dune-tms/bin/ConvertT
oTMSTree.exe
0061b000-01d2e000 rw-p 00000000 00:00 0 
03a14000-1967f000 rw-p 00000000 00:00 0                                  [heap]
7f78b0000000-7f78b0021000 rw-p 00000000 00:00 0 
7f78b0021000-7f78b4000000 ---p 00000000 00:00 0 
7f78b69d7000-7f78b8674000 rw-p 00000000 00:00 0 
7f78bb052000-7f78bb074000 r-xp 00000000 00:9a 18879                      /usr/lib64/ld-2.17.so
7f78bb074000-7f78bb273000 ---p 00022000 00:9a 18879                      /usr/lib64/ld-2.17.so
7f78bb273000-7f78bb274000 r--p 00021000 00:9a 18879                      /usr/lib64/ld-2.17.so
7f78bb274000-7f78bb275000 rw-p 00022000 00:9a 18879                      /usr/lib64/ld-2.17.so
7f78bb275000-7f78bb477000 rw-p 00000000 00:00 0 
7f78bb477000-7f78bb539000 r--p 00000000 00:6e 15047868                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/SessionViewer.pcm
7f78bb539000-7f78bb660000 r--p 00000000 00:6e 15047941                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTFitPanelv7.pcm
7f78bb660000-7f78bb714000 r--p 00000000 00:6e 15047900                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTGraphicsPrimitives.pcm
7f78bb714000-7f78bb799000 r--p 00000000 00:6e 15047874                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTTPython.pcm
7f78bb799000-7f78bb835000 r--p 00000000 00:6e 15047924                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Gviz3d.pcm
7f78bb835000-7f78bb8e2000 r--p 00000000 00:6e 15047919                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ProofBench.pcm
7f78bb8e2000-7f78bba27000 r--p 00000000 00:6e 15047912                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Gdml.pcm
7f78bba27000-7f78bbaba000 r--p 00000000 00:6e 15047699                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RootAuth.pcm
7f78bbaba000-7f78bbc00000 rw-p 00000000 00:00 0 
7f78bbc00000-7f78bbed3000 r--p 00000000 00:6e 15047907                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Minuit2.pcm
7f78bbed8000-7f78bbf60000 r--p 00000000 00:6e 15047986                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ASImageGui.pcm
7f78bbf60000-7f78bc000000 r--p 00000000 00:6e 15047865                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/NetxNG.pcm
7f78bc000000-7f78bc31a000 r--p 00000000 00:6e 15047883                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTHist.pcm
7f78bc375000-7f78bc400000 r--p 00000000 00:6e 15047943                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/SpectrumPainter.pcm
7f78bc400000-7f78bc811000 r--p 00000000 00:6e 15047935                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTHistDraw.pcm
7f78bc859000-7f78bc8eb000 r--p 00000000 00:6e 15047876                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ASImage.pcm
7f78bc8eb000-7f78bc97b000 r--p 00000000 00:6e 15047859                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/GX11.pcm
7f78bc97b000-7f78bca00000 r--p 00000000 00:6e 15047884                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/GX11TTF.pcm
7f78bca00000-7f78bd0a7000 r--p 00000000 00:6e 15047881                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTGpadv7.pcm
7f78bd11f000-7f78bd1d9000 r--p 00000000 00:6e 15047862                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/SQLIO.pcm
7f78bd1d9000-7f78bd27e000 r--p 00000000 00:6e 15047913                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/GuiBld.pcm
7f78bd27e000-7f78bd400000 r--p 00000000 00:6e 15047871                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/WebGui6.pcm
7f78bd400000-7f78bd971000 r--p 00000000 00:6e 15047942                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTEve.pcm
7f78bd973000-7f78bda00000 r--p 00000000 00:6e 15047877                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/FFTW.pcm
7f78bda00000-7f78bddb1000 r--p 00000000 00:6e 15047947                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RooStats.pcm
7f78bde00000-7f78bed94000 r--p 00000000 00:6e 15047931                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RooFitCore.pcm
7f78bee00000-7f78bfbe0000 r--p 00000000 00:6e 15050471                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RooFit.pcm
7f78bfc00000-7f78bff54000 r--p 00000000 00:6e 15047960                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/HistFactory.pcm
7f78bffc2000-7f78c00b9000 r--p 00000000 00:6e 15047898                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/GeomBuilder.pcm
7f78c00b9000-7f78c0600000 rw-p 00000000 00:00 0 
7f78c0600000-7f78c0b28000 r--p 00000000 00:6e 15047995                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RGL.pcm
7f78c0b5a000-7f78c0c00000 r--p 00000000 00:6e 15047946                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Recorder.pcm
7f78c0c00000-7f78c1162000 r--p 00000000 00:6e 15047928                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Eve.pcm
7f78c1172000-7f78c1200000 r--p 00000000 00:6e 15047978                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Hbook.pcm
7f78c1200000-7f78c1550000 r--p 00000000 00:6e 15047984                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Gui.pcm
7f78c1577000-7f78c1600000 r--p 00000000 00:6e 15047980                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/SPlot.pcm
7f78c1600000-7f78c1d08000 r--p 00000000 00:6e 15047886                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTNTuple.pcm
7f78c1d61000-7f78c1e00000 r--p 00000000 00:6e 15047923                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/EGPythia6.pcm
7f78c1e00000-7f78c20a6000 r--p 00000000 00:6e 15047901                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTVecOps.pcm
7f78c20b1000-7f78c2200000 r--p 00000000 00:6e 15047789                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/MathMore.pcm
7f78c2200000-7f78c2da1000 r--p 00000000 00:6e 15048099                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTDataFrame.pcm
7f78c2e00000-7f78c3ea8000 r--p 00000000 00:6e 15048079                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/TMVA.pcm
7f78c3ede000-7f78c3f67000 r--p 00000000 00:6e 15047882                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/X3d.pcm
7f78c3f67000-7f78c4000000 r--p 00000000 00:6e 15047891                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RooFitMore.pcm
7f78c4000000-7f78c435b000 r--p 00000000 00:6e 15047952                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Geom.pcm
7f78c4374000-7f78c4400000 r--p 00000000 00:6e 15047908                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Fumili.pcm
7f78c4400000-7f78c4848000 r--p 00000000 00:6e 15048021                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/TreePlayer.pcm
7f78c4861000-7f78c4a00000 r--p 00000000 00:6e 15047976                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTWebDisplay.pcm
7f78c4a00000-7f78c4c5e000 r--p 00000000 00:6e 15047424                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Tree.pcm
7f78c4c69000-7f78c4e00000 r--p 00000000 00:6e 15047970                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Html.pcm
7f78c4e00000-7f78c50da000 r--p 00000000 00:6e 15047863                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Smatrix.pcm
7f78c50ed000-7f78c5200000 r--p 00000000 00:6e 15047927                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ROOTBrowserv7.pcm
7f78c5200000-7f78c54d3000 r--p 00000000 00:6e 15047870                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/GenVector.pcm
7f78c54d8000-7f78c5559000 r--p 00000000 00:6e 15047902                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Rint.pcm
7f78c5559000-7f78c5600000 r--p 00000000 00:6e 15047926                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Postscript.pcm
7f78c5600000-7f78c5b26000 r--p 00000000 00:6e 15047861                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Hist.pcm
7f78c5b51000-7f78c5c00000 r--p 00000000 00:6e 15047959                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/TMVAGui.pcm
7f78c5c00000-7f78c63ef000 r--p 00000000 00:6e 15047915                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/MathCore.pcm
7f78c6400000-7f78c6715000 r--p 00000000 00:6e 15047847                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RIO.pcm
7f78c6719000-7f78c67ca000 r--p 00000000 00:6e 15047937                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/GuiHtml.pcm
7f78c67ca000-7f78c6876000 r--p 00000000 00:6e 15047852                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Spectrum.pcm
7f78c6876000-7f78c6919000 r--p 00000000 00:6e 15047714                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Foam.pcm
7f78c6919000-7f78c6a00000 r--p 00000000 00:6e 15047728                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/PgSQL.pcm
7f78c6a00000-7f78c73e2000 r--p 00000000 00:6e 15047892                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Core.pcm
7f78c7400000-7f78c7812000 r--p 00000000 00:6e 15047867                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/_Builtin_intrinsics.pcm
7f78c782a000-7f78c78f4000 r--p 00000000 00:6e 15047873                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/TreeViewer.pcm
7f78c78f4000-7f78c7a00000 r--p 00000000 00:6e 15047889                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/Ged.pcm
7f78c7a00000-7f78c8e5e000 r--p 00000000 00:6e 15047910                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/std.pcm
7f78c8eda000-7f78c8fce000 r--p 00000000 00:6e 15047904                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/ProofDraw.pcm
7f78c8fce000-7f78c8ff3000 r-xp 00000000 00:9a 22616                      /usr/lib64/libtinfo.so.5.9
7f78c8ff3000-7f78c91f3000 ---p 00025000 00:9a 22616                      /usr/lib64/libtinfo.so.5.9
7f78c91f3000-7f78c91f7000 r--p 00025000 00:9a 22616                      /usr/lib64/libtinfo.so.5.9
7f78c91f7000-7f78c91f8000 rw-p 00029000 00:9a 22616                      /usr/lib64/libtinfo.so.5.9
7f78c91f8000-7f78c91ff000 r-xp 00000000 00:9a 20798                      /usr/lib64/librt-2.17.so
7f78c91ff000-7f78c93fe000 ---p 00007000 00:9a 20798                      /usr/lib64/librt-2.17.so
7f78c93fe000-7f78c93ff000 r--p 00006000 00:9a 20798                      /usr/lib64/librt-2.17.so
7f78c93ff000-7f78c9400000 rw-p 00007000 00:9a 20798                      /usr/lib64/librt-2.17.so
7f78c9400000-7f78ccec1000 r-xp 00000000 00:6e 15053249                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libCling.so
7f78ccec1000-7f78cd0c1000 ---p 03ac1000 00:6e 15053249                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libCling.so
7f78cd0c1000-7f78cd302000 r--p 03ac1000 00:6e 15053249                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libCling.so
7f78cd302000-7f78cd312000 rw-p 03d02000 00:6e 15053249                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libCling.so
7f78cd312000-7f78cd340000 rw-p 00000000 00:00 0 
7f78cd36a000-7f78cd47d000 r--p 00000000 00:6e 15047869                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/RMySQL.pcm
7f78cd47d000-7f78cd489000 r-xp 00000000 00:9a 17981                      /usr/lib64/libnss_files-2.17.so
7f78cd489000-7f78cd688000 ---p 0000c000 00:9a 17981                      /usr/lib64/libnss_files-2.17.so
7f78cd688000-7f78cd689000 r--p 0000b000 00:9a 17981                      /usr/lib64/libnss_files-2.17.so
7f78cd689000-7f78cd68a000 rw-p 0000c000 00:9a 17981                      /usr/lib64/libnss_files-2.17.so
7f78cd68a000-7f78cd690000 rw-p 00000000 00:00 0 
7f78cd690000-7f78cd6b4000 r-xp 00000000 00:9a 20875                      /usr/lib64/libselinux.so.1
7f78cd6b4000-7f78cd8b3000 ---p 00024000 00:9a 20875                      /usr/lib64/libselinux.so.1
7f78cd8b3000-7f78cd8b4000 r--p 00023000 00:9a 20875                      /usr/lib64/libselinux.so.1
7f78cd8b4000-7f78cd8b5000 rw-p 00024000 00:9a 20875                      /usr/lib64/libselinux.so.1
7f78cd8b5000-7f78cd8b7000 rw-p 00000000 00:00 0 
7f78cd8b7000-7f78cd8cd000 r-xp 00000000 00:9a 18012                      /usr/lib64/libresolv-2.17.so
7f78cd8cd000-7f78cdacd000 ---p 00016000 00:9a 18012                      /usr/lib64/libresolv-2.17.so
7f78cdacd000-7f78cdace000 r--p 00016000 00:9a 18012                      /usr/lib64/libresolv-2.17.so
7f78cdace000-7f78cdacf000 rw-p 00017000 00:9a 18012                      /usr/lib64/libresolv-2.17.so
7f78cdacf000-7f78cdad1000 rw-p 00000000 00:00 0 
7f78cdad1000-7f78cdad4000 r-xp 00000000 00:9a 19762                      /usr/lib64/libkeyutils.so.1.5
7f78cdad4000-7f78cdcd3000 ---p 00003000 00:9a 19762                      /usr/lib64/libkeyutils.so.1.5
7f78cdcd3000-7f78cdcd4000 r--p 00002000 00:9a 19762                      /usr/lib64/libkeyutils.so.1.5
7f78cdcd4000-7f78cdcd5000 rw-p 00003000 00:9a 19762                      /usr/lib64/libkeyutils.so.1.5
7f78cdcd5000-7f78cdce3000 r-xp 00000000 00:9a 24854                      /usr/lib64/libkrb5support.so.0.1
7f78cdce3000-7f78cdee3000 ---p 0000e000 00:9a 24854                      /usr/lib64/libkrb5support.so.0.1
7f78cdee3000-7f78cdee4000 r--p 0000e000 00:9a 24854                      /usr/lib64/libkrb5support.so.0.1
7f78cdee4000-7f78cdee5000 rw-p 0000f000 00:9a 24854                      /usr/lib64/libkrb5support.so.0.1
7f78cdee5000-7f78cdee7000 r-xp 00000000 00:9a 22098                      /usr/lib64/libXau.so.6.0.0
7f78cdee7000-7f78ce0e7000 ---p 00002000 00:9a 22098                      /usr/lib64/libXau.so.6.0.0
7f78ce0e7000-7f78ce0e8000 r--p 00002000 00:9a 22098                      /usr/lib64/libXau.so.6.0.0
7f78ce0e8000-7f78ce0e9000 rw-p 00003000 00:9a 22098                      /usr/lib64/libXau.so.6.0.0
7f78ce0e9000-7f78ce0f8000 r-xp 00000000 00:9a 22896                      /usr/lib64/libbz2.so.1.0.6
7f78ce0f8000-7f78ce2f7000 ---p 0000f000 00:9a 22896                      /usr/lib64/libbz2.so.1.0.6
7f78ce2f7000-7f78ce2f8000 r--p 0000e000 00:9a 22896                      /usr/lib64/libbz2.so.1.0.6
7f78ce2f8000-7f78ce2f9000 rw-p 0000f000 00:9a 22896                      /usr/lib64/libbz2.so.1.0.6
7f78ce2f9000-7f78ce32a000 r-xp 00000000 00:9a 20873                      /usr/lib64/libk5crypto.so.3.1
7f78ce32a000-7f78ce529000 ---p 00031000 00:9a 20873                      /usr/lib64/libk5crypto.so.3.1
7f78ce529000-7f78ce52b000 r--p 00030000 00:9a 20873                      /usr/lib64/libk5crypto.so.3.1
7f78ce52b000-7f78ce52c000 rw-p 00032000 00:9a 20873                      /usr/lib64/libk5crypto.so.3.1
7f78ce52c000-7f78ce52f000 r-xp 00000000 00:9a 18839                      /usr/lib64/libcom_err.so.2.1
7f78ce52f000-7f78ce72e000 ---p 00003000 00:9a 18839                      /usr/lib64/libcom_err.so.2.1
7f78ce72e000-7f78ce72f000 r--p 00002000 00:9a 18839                      /usr/lib64/libcom_err.so.2.1
7f78ce72f000-7f78ce730000 rw-p 00003000 00:9a 18839                      /usr/lib64/libcom_err.so.2.1
7f78ce730000-7f78ce809000 r-xp 00000000 00:9a 18263                      /usr/lib64/libkrb5.so.3.3
7f78ce809000-7f78cea08000 ---p 000d9000 00:9a 18263                      /usr/lib64/libkrb5.so.3.3
7f78cea08000-7f78cea16000 r--p 000d8000 00:9a 18263                      /usr/lib64/libkrb5.so.3.3
7f78cea16000-7f78cea19000 rw-p 000e6000 00:9a 18263                      /usr/lib64/libkrb5.so.3.3
7f78cea19000-7f78cea63000 r-xp 00000000 00:9a 21848                      /usr/lib64/libgssapi_krb5.so.2.2
7f78cea63000-7f78cec63000 ---p 0004a000 00:9a 21848                      /usr/lib64/libgssapi_krb5.so.2.2
7f78cec63000-7f78cec64000 r--p 0004a000 00:9a 21848                      /usr/lib64/libgssapi_krb5.so.2.2
7f78cec64000-7f78cec66000 rw-p 0004b000 00:9a 21848                      /usr/lib64/libgssapi_krb5.so.2.2
7f78cec66000-7f78cec8d000 r-xp 00000000 00:9a 18906                      /usr/lib64/libxcb.so.1.1.0
7f78cec8d000-7f78cee8c000 ---p 00027000 00:9a 18906                      /usr/lib64/libxcb.so.1.1.0
7f78cee8c000-7f78cee8d000 r--p 00026000 00:9a 18906                      /usr/lib64/libxcb.so.1.1.0
7f78cee8d000-7f78cee8e000 rw-p 00027000 00:9a 18906                      /usr/lib64/libxcb.so.1.1.0
7f78cee8e000-7f78cee92000 r-xp 00000000 00:9a 25081                      /usr/lib64/libuuid.so.1.3.0
7f78cee92000-7f78cf091000 ---p 00004000 00:9a 25081                      /usr/lib64/libuuid.so.1.3.0
7f78cf091000-7f78cf092000 r--p 00003000 00:9a 25081                      /usr/lib64/libuuid.so.1.3.0
7f78cf092000-7f78cf093000 rw-p 00004000 00:9a 25081                      /usr/lib64/libuuid.so.1.3.0
7f78cf093000-7f78cf0bc000 r-xp 00000000 00:9a 20853                      /usr/lib64/libpng15.so.15.13.0
7f78cf0bc000-7f78cf2bc000 ---p 00029000 00:9a 20853                      /usr/lib64/libpng15.so.15.13.0
7f78cf2bc000-7f78cf2bd000 r--p 00029000 00:9a 20853                      /usr/lib64/libpng15.so.15.13.0
7f78cf2bd000-7f78cf2be000 rw-p 0002a000 00:9a 20853                      /usr/lib64/libpng15.so.15.13.0
7f78cf2be000-7f78cf32c000 r-xp 00000000 00:9a 22210                      /usr/lib64/libGLdispatch.so.0.0.0
7f78cf32c000-7f78cf52b000 ---p 0006e000 00:9a 22210                      /usr/lib64/libGLdispatch.so.0.0.0
7f78cf52b000-7f78cf553000 r--p 0006d000 00:9a 22210                      /usr/lib64/libGLdispatch.so.0.0.0
7f78cf553000-7f78cf554000 rw-p 00095000 00:9a 22210                      /usr/lib64/libGLdispatch.so.0.0.0
7f78cf554000-7f78cf574000 rw-p 00000000 00:00 0 
7f78cf574000-7f78cf5e5000 r-xp 00000000 00:9a 19243                      /usr/lib64/libGL.so.1.7.0
7f78cf5e5000-7f78cf7e4000 ---p 00071000 00:9a 19243                      /usr/lib64/libGL.so.1.7.0
7f78cf7e4000-7f78cf7fe000 r--p 00070000 00:9a 19243                      /usr/lib64/libGL.so.1.7.0
7f78cf7fe000-7f78cf7ff000 rw-p 0008a000 00:9a 19243                      /usr/lib64/libGL.so.1.7.0
7f78cf7ff000-7f78cf800000 rw-p 00000000 00:00 0 
7f78cf800000-7f78cf861000 r-xp 00000000 00:6e 15048124                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libROOTNTuple.so
7f78cf861000-7f78cfa61000 ---p 00061000 00:6e 15048124                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libROOTNTuple.so
7f78cfa61000-7f78cfa63000 r--p 00061000 00:6e 15048124                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libROOTNTuple.so
7f78cfa63000-7f78cfa64000 rw-p 00063000 00:6e 15048124                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/libROOTNTuple.so
7f78cfa64000-7f78cfa65000 rw-p 00000000 00:00 0 
7f78cfa6c000-7f78cfb23000 r-xp 00000000 00:9a 23082                      /usr/lib64/libfreetype.so.6.14.0
7f78cfb23000-7f78cfd23000 ---p 000b7000 00:9a 23082                      /usr/lib64/libfreetype.so.6.14.0
7f78cfd23000-7f78cfd2a000 r--p 000b7000 00:9a 23082                      /usr/lib64/libfreetype.so.6.14.0
7f78cfd2a000-7f78cfd2b000 rw-p 000be000 00:9a 23082                      /usr/lib64/libfreetype.so.6.14.0
7f78cfd2b000-7f78cff62000 r-xp 00000000 00:9a 27602                      /usr/lib64/libcrypto.so.1.0.2k
7f78cff62000-7f78d0161000 ---p 00237000 00:9a 27602                      /usr/lib64/libcrypto.so.1.0.2k
7f78d0161000-7f78d017d000 r--p 00236000 00:9a 27602                      /usr/lib64/libcrypto.so.1.0.2k
7f78d017d000-7f78d018a000 rw-p 00252000 00:9a 27602                      /usr/lib64/libcrypto.so.1.0.2k
7f78d018a000-7f78d018e000 rw-p 00000000 00:00 0 
7f78d018e000-7f78d01f5000 r-xp 00000000 00:9a 22029                      /usr/lib64/libssl.so.1.0.2k
7f78d01f5000-7f78d03f5000 ---p 00067000 00:9a 22029                      /usr/lib64/libssl.so.1.0.2k
7f78d03f5000-7f78d03f9000 r--p 00067000 00:9a 22029                      /usr/lib64/libssl.so.1.0.2k
7f78d03f9000-7f78d0400000 rw-p 0006b000 00:9a 22029                      /usr/lib64/libssl.so.1.0.2k
7f78d0400000-7f78d043b000 r-xp 00000000 00:6e 15069358                   /cvmfs/larsoft.opensciencegrid.org/products/tbb/v2021_1_1/Linux64
bit+3.10-2.17-e20/lib/libtbb.so.12.1
7f78d043b000-7f78d063b000 ---p 0003b000 00:6e 15069358                   /cvmfs/larsoft.opensciencegrid.org/products/tbb/v2021_1_1/Linux64
bit+3.10-2.17-e20/lib/libtbb.so.12.1
7f78d063b000-7f78d063c000 r--p 0003b000 00:6e 15069358                   /cvmfs/larsoft.opensciencegrid.org/products/tbb/v2021_1_1/Linux64
bit+3.10-2.17-e20/lib/libtbb.so.12.1
7f78d063c000-7f78d063f000 rw-p 0003c000 00:6e 15069358                   /cvmfs/larsoft.opensciencegrid.org/products/tbb/v2021_1_1/Linux64
bit+3.10-2.17-e20/lib/libtbb.so.12.1
7f78d063f000-7f78d0641000 rw-p 00000000 00:00 0 
7f78d0656000-7f78d0702000 r--p 00000000 00:6e 15047987                   /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux6
4bit+3.10-2.17-e20-p392-prof/lib/EG.pcm
7f78d0702000-7f78d07bc000 r-xp 00000000 00:9a 22263                      /usr/lib64/libzstd.so.1.5.5
7f78d07bc000-7f78d09bb000 ---p 000ba000 00:9a 22263                      /usr/lib64/libzstd.so.1.5.5
7f78d09bb000-7f78d09bc000 r--p 000b9000 00:9a 22263                      /usr/lib64/libzstd.so.1.5.5
7f78d09bc000-7f78d09bd000 rw-p 000ba000 00:9a 22263                      /usr/lib64/libzstd.so.1.5.5
7f78d09bd000-7f78d09d2000 r-xp 00000000 00:9a 21846                      /usr/lib64/libz.so.1.2.7
7f78d09d2000-7f78d0bd1000 ---p 00015000 00:9a 21846                      /usr/lib64/libz.so.1.2.7
7f78d0bd1000-7f78d0bd2000 r--p 00014000 00:9a 21846                      /usr/lib64/libz.so.1.2.7
7f78d0bd2000-7f78d0bd3000 rw-p 00015000 00:9a 21846                      /usr/lib64/libz.so.1.2.7
7f78d0bd3000-7f78d0be1000 r-xp 00000000 00:9a 27344                      /usr/lib64/liblz4.so.1.8.3
7f78d0be1000-7f78d0de0000 ---p 0000e000 00:9a 27344                      /usr/lib64/liblz4.so.1.8.3
7f78d0de0000-7f78d0de1000 r--p 0000d000 00:9a 27344                      /usr/lib64/liblz4.so.1.8.3
7f78d0de1000-7f78d0de2000 rw-p 0000e000 00:9a 27344                      /usr/lib64/liblz4.so.1.8.3Aborted (core dumped)
jdkio commented 3 weeks ago

I think part of these issues are when values are saved using a pass by reference. Like https://github.com/DUNE/dune-tms/blob/main/src/TMS_TrueParticle.h#L69-L78

jdkio commented 3 weeks ago

But modifying TrueParticle to not have those didn't fix it

LiamOS commented 3 weeks ago

Have you had a chance to try an earlier commit? (e.g. 62232f6ab4b312d6ab62af5f676091cee41e5a84) It's hard to believe we didn't break something in the last while if it stopped working on A9 and SL7 simultaneously.

I'll don't have that much time this week to focus on this, if we have to revert and split the desired commits into more manageable and reviewable parts we can do that. I can still update CAFMaker even if this side is still reviewing and merging.

jdkio commented 3 weeks ago

Switched to liam_lmao. Writing notes here, feel free to head straight to my confirmation of the issue below

Starting with AL9, just to see. First cherry picked the setup fix.

On AL9, it finishes all the way with a bunch of "Not all hits in separated hit groups!" warnings.

Overlaying 15 events
Sliced event 1951 into 10 slices
Not all hits in separated hit groups!
Not all hits in separated hit groups!

Then it crashes

#5  0x00007f8ad077d060 in TBufferFile::WriteFastArray(int const*, long long) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#6  0x00007f8ad0a2c782 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#7  0x00007f8ad085ae3d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#8  0x00007f8ad078443d in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#9  0x00007f8acfc903ee in TBranch::Streamer(TBuffer&) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libTree.so.6.30
#10 0x00007f8ad078373b in TBufferFile::WriteObjectClass(void const*, TClass const*, bool) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#11 0x00007f8ad078b4f4 in TBufferIO::WriteObjectAny(void const*, TClass const*, bool) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#12 0x00007f8ad0d925ed in TObjArray::Streamer(TBuffer&) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libCore.so.6.30
#13 0x00007f8ad077e994 in TBufferFile::WriteFastArray(void*, TClass const*, long long, TMemberStreamer*) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#14 0x00007f8ad0a2ae12 in int TStreamerInfo::WriteBufferAux<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#15 0x00007f8ad085ae3d in TStreamerInfoActions::GenericWriteAction(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#16 0x00007f8ad078443d in TBufferFile::WriteClassBuffer(TClass const*, void*) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#17 0x00007f8ad08344ac in TKey::TKey(TObject const*, char const*, int, TDirectory*) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#18 0x00007f8ad07e8ea5 in TFile::CreateKey(TDirectory*, TObject const*, char const*, int) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#19 0x00007f8ad07da2fe in TDirectoryFile::WriteTObject(TObject const*, char const*, char const*, int) () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libRIO.so.6.30
#20 0x00007f8ad0d0fb15 in TObject::Write(char const*, int, int) const () from /exp/dune/data/users/losulliv/september-spack/opt/spack/linux-almalinux9-x86_64_v3/gcc-11.4.1/root-6.30.08-mtmdkre34xhzbn3qya6uqf3vo6jcobjx/lib/root/libCore.so.6.30
#21 0x000000000040ed07 in TMS_TreeWriter::Write (this=0xbd4de0 <TMS_TreeWriter::GetWriter()::Instance>) at ../src/TMS_TreeWriter.h:37
#22 ConvertToTMSTree (filename=..., output_filename=...) at ConvertToTMSTree.cpp:243
#23 0x000000000040921d in main (argc=<optimized out>, argv=<optimized out>) at ConvertToTMSTree.cpp:272

Reading the output file, Error in <TFile::ReadKeys>: reading illegal key, exiting after 2 keys. It only has Reco_Tree and Truth_Info trees and they only have 1000 entries. This is from the auto save which is set at 1000. Branch_Lines (aka Line_Candidates) was never filled during the fill command. It's commented out //Branch_Lines->Fill(); And there are fewer than 1000 spills so Truth_Spill is empty too.

Branch_Lines was removed by Liam to fix one of the crashes I guess. So this makes me think that Branch_Lines has some sort of issue and the first time it tries to write to disk, it crashes. By commenting it out, it stops crashing every 1000 entries and instead it crashes later.

Adding back Branch_Lines doesn't crash sooner. Instead, it still crashes at the end with

double free or corruption (!prev)
Aborted (core dumped)

Running it through gdb was not very useful

Event loop took 74.9501s for 1952 entries (0.0383966 s/entries)
TMS_TreeWriter wrote output to MicroProdN1p2_NDLAr_1E18_RHC.spill.nu.0000003.EDEPSIM_SPILLS_TMS_RecoCandidates_Hough_Cluster1.root
TMS_ReadoutTreeWriter wrote output to MicroProdN1p2_NDLAr_1E18_RHC.spill.nu.0000003.EDEPSIM_SPILLS_TMS_Readout.root
double free or corruption (!prev)

Program received signal SIGABRT, Aborted.
0x00007ffff4a8b94c in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-100.el9_4.3.x86_64 libglvnd-1.3.4-1.el9.x86_64 libglvnd-glx-1.3.4-1.el9.x86_64 libglvnd-opengl-1.3.4-1.el9.x86_64 nss-altfiles-2.18.1-20.el9.x86_64 openssl-libs-3.0.7-27.el9.x86_64
(gdb) bt
#0  0x00007ffff4a8b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff4a3e646 in raise () from /lib64/libc.so.6
#2  0x00007ffff4a287f3 in abort () from /lib64/libc.so.6
#3  0x00007ffff4a29130 in __libc_message.cold () from /lib64/libc.so.6
#4  0x00007ffff4a959f7 in malloc_printerr () from /lib64/libc.so.6
#5  0x00007ffff4a976ec in _int_free () from /lib64/libc.so.6
#6  0x00007ffff4a99d35 in free () from /lib64/libc.so.6
#7  0x00007ffff4a412a7 in __cxa_finalize () from /lib64/libc.so.6
#8  0x00007ffff3940707 in __do_global_dtors_aux ()
   from /cvmfs/larsoft.opensciencegrid.org/spack-packages/opt/spack/linux-almalinux9-x86_64_v2/gcc-12.2.0/root-6.28.06-jhpj2jsdlwoxbvpnwmxvzkntrxcgw5of/lib/root/libROOTDataFrame.so.6.28
#9  0x00007ffffffe2290 in ?? ()
#10 0x00007ffff7fcbe2e in _dl_fini () at dl-fini.c:142
Backtrace stopped: frame did not save the PC

Branch_Lines->Fill() is being run before all variables are set. That may be a problem if any of them are in Branch_Lines. Moving it to the end of the function with Reco_Tree->Fill()

Stopping at 150 events gives very strange behavior. It'll randomly crash at various points of the code. After a few tries, now it only crashes with double free or corruption (!prev) at the end of the 150 events. But I saw one crash at 19 events, and one ~100 events. I think I've seen something like this before. I think there's some uninitialized variable. If it's somewhat valid, the crash happens one way, and if it's not, then it crashes sooner.

Switching to SL7, setting up and making. Running a single event in AL9 crashed, but doing it in SL7 doesn't. Doing 150 events gives:

Processed 26/150 (17.3333%)
*** Error in `ConvertToTMSTree.exe': free(): invalid next size (normal): 0x00000000107f4120 ***
======= Backtrace: =========
...
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN5TTree8GetEntryExi+0xbc)[0x7f546a92a31c]
ConvertToTMSTree.exe(_Z16ConvertToTMSTreeNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_+0x867)[0x408de7]
ConvertToTMSTree.exe(main+0xc6)[0x4082e6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5467038555] 
ConvertToTMSTree.exe[0x40842d]

addr2line -e ConvertToTMSTree.exe 0x408de7 -> ConvertToTMSTree.cpp:114, which is if (gRoo), but doing it again in gdb and running bt, I get

#24 0x0000000000408de7 in ConvertToTMSTree (filename=..., output_filename=...) at ConvertToTMSTree.cpp:112
#25 0x00000000004082e6 in main (argc=<optimized out>, argv=<optimized out>) at ConvertToTMSTree.cpp:272

Line 112 is events->GetEntry(i);. This is the behavior I saw previously; it's trying to load the next entry but the TG4Event is corrupted from previous stuff. Here's some more info from the bt

#10 0x00007ffff7822490 in std::_Destroy<TG4Trajectory> (__pointer=0xe528880)
    at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:98
#11 std::_Destroy_aux<false>::__destroy<TG4Trajectory*> (__last=<optimized out>, __first=0xe528880)
    at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:108
#12 std::_Destroy<TG4Trajectory*> (__last=<optimized out>, __first=<optimized out>)
    at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:137
#13 std::_Destroy<TG4Trajectory*, TG4Trajectory> (__last=<optimized out>, __first=<optimized out>)
    at /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_construct.h:206

TLorentzVector is also saved in TMS_Event for neutrino & lepton location and momemtum info. After TMS_Events are added for the overlay, these might be getting deleted. I'm going to start by removing them. Yes, that fixes things. I don't get a crash anymore. Changing to run over all entries to make sure, and yes, it doesn't crash

Solution

So yes, TLorentzVector (and possibly TVector3) are not deep copied. So when we delete an object that has one, it will crash when TG4Event wants to delete its TG4Trajectory. It'll cause a double free crash, sometimes a bus error, simple seg faults, and possibly others

TMS_Events are deleted (including their TLorentzVectors) after we combine several of them. Then when the next TG4Event is loaded, it crashes because the TLorentzVectors were corrupted. This isn't super consistent but it happens. Deleting all TLorentzVector info in them gets rid of the issue. Another fix that would allow us to keep the info is to add a destructor that clears those vectors. This fix worked for TMS_TrueParticle

I'm running on 20 files to test a larger sample (so far I'm 5/5). If that works, I'll try using a TMS_Event destructor instead so we can keep the neutrino info. Either way, I'll try to commit something tonight

jdkio commented 3 weeks ago

Continuation of the notes. See the end section

I ran over 20 files. For non-zero exit codes, I moved the output to /exp/dune/data/users/kleykamp/dune-tms/2024-09-19_test_liam_lmao. Logs are stored in logs. This is running over the current microprod /pnfs/dune/persistent/users/abooth/nd-production/MicroProdN1p2/output/run-spill-build/MicroProdN1p2_NDLAr_1E18_RHC.spill.nu/EDEPSIM_SPILLS/0000000/0000000/

Total files processed: 20
Nonzero exit codes: 3
Average time per command: 78 seconds

Much better but still not great. It crashes for runs 15 (exit code 134), 17 (139), and 18 (134).

Unfortunately, the errors for 15 and 18 are errors don't make it into the logs. Here's what I see on the screen. They are slightly different 15: *** Error in `ConvertToTMSTree.exe': double free or corruption (!prev): 0x000000000f820f40 *** exit code 134 18: *** Error in `ConvertToTMSTree.exe': corrupted size vs. prev_size: 0x000000000ff1b5e0 ***

15 full error

Here's everything but the memory map. I added the first two lines from the log file

Processed 21/1969 (1.06653%)
timeout: the monitored command dumped core
*** Error in `ConvertToTMSTree.exe': double free or corruption (!prev): 0x000000000f820f40 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7fbfbe497329]
/cvmfs/dune.opensciencegrid.org/products/dune/./edepsim/v3_2_0/Linux64bit+3.10-2.17-e20-prof/lib/libedepsim_io.so(_ZN13TG4TrajectoryD1Ev+0xf8)[0x7fbfc7013b18]
/cvmfs/dune.opensciencegrid.org/products/dune/./edepsim/v3_2_0/Linux64bit+3.10-2.17-e20-prof/lib/libedepsim_io.so(_ZN4ROOT6Detail20TCollectionProxyInfo8PushbackISt6vectorI13TG4TrajectorySaIS4_EEE6resizeEPvm+0x60)[0x7fbfc7022490]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libRIO.so(_ZN19TGenCollectionProxy8AllocateEjb+0x13b)[0x7fbfc3d7cf3b]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN14TBranchElement20ReadLeavesCollectionER7TBuffer+0x1bb)[0x7fbfc1cbf84b]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN7TBranch8GetEntryExi+0xe1)[0x7fbfc1cb39e1]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN14TBranchElement8GetEntryExi+0x455)[0x7fbfc1cce8d5]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN14TBranchElement8GetEntryExi+0x116)[0x7fbfc1cce596]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN5TTree8GetEntryExi+0xbc)[0x7fbfc1d2a31c]
ConvertToTMSTree.exe(_Z16ConvertToTMSTreeNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_+0x7f7)[0x408c47]
ConvertToTMSTree.exe(main+0xc6)[0x4081b6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fbfbe438555]
ConvertToTMSTree.exe[0x4082fd]
addr2line -e ConvertToTMSTree.exe 0x408c47
/exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/app/ConvertToTMSTree.cpp:114

This maps to if (gRoo) again. So my guess is this is another TLorentzVector3 type error that we haven't solved.

18 full error

Event loop took 79.406s for 2056 entries (0.0386216 s/entries)
TMS_TreeWriter wrote output to MicroProdN1p2_NDLAr_1E18_RHC.spill.nu.0000018.EDEPSIM_SPILLS_TMS_RecoCandidates_Hough_Cluster1.root
timeout: the monitored command dumped core
*** Error in `ConvertToTMSTree.exe': corrupted size vs. prev_size: 0x000000000ff1b5e0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x80c37)[0x7f74e7696c37]
/lib64/libc.so.6(+0x8120e)[0x7f74e769720e]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN9TObjArray6DeleteEPKc+0x74)[0x7f74eda5da84]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN7TBranchD2Ev+0xcf)[0x7f74eaeb621f]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN7TBranchD0Ev+0x12)[0x7f74eaeb6472]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN9TObjArray6DeleteEPKc+0x74)[0x7f74eda5da84]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN5TTreeD2Ev+0x1f1)[0x7f74eaf2d2f1]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libTree.so(_ZN5TTreeD0Ev+0x12)[0x7f74eaf2d902]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN5TList6DeleteEPKc+0xba8)[0x7f74eda57a48]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libCore.so(_ZN9THashList6DeleteEPKc+0x83)[0x7f74eda4d9f3]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libRIO.so(_ZN14TDirectoryFile5CloseEPKc+0xac)[0x7f74ecf3764c]
/cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/lib/libRIO.so(_ZN5TFile5CloseEPKc+0x15b)[0x7f74ecf5490b]
ConvertToTMSTree.exe(_Z16ConvertToTMSTreeNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_+0x36c0)[0x40bb10]
ConvertToTMSTree.exe(main+0xc6)[0x4081b6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f74e7638555] 
ConvertToTMSTree.exe[0x4082fd]

So the crash happens after the files are completely written.

addr2line -e ConvertToTMSTree.exe 0x40bb10
/exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/app/../src/TMS_ReadoutTreeWriter.h:18
-> static TMS_ReadoutTreeWriter Instance;

Very confused. Running it through gdb instead. Yes, this makes more sense:

#18 TDirectoryFile::Close (this=this@entry=0x1603d270, option=<optimized out>, option@entry=0x415862 "")
    at /scratch/workspace/canvas-products-all/vedge-/SLF7/e20-prof/build/root/v6_22_08d/source/root-6.22.08/io/io/src/TDirectoryFile.cxx:5
47
#19 0x00007ffff455490b in TFile::Close (this=0x1603d270, option=0x415862 "")
    at /scratch/workspace/canvas-products-all/vedge-/SLF7/e20-prof/build/root/v6_22_08d/source/root-6.22.08/io/io/src/TFile.cxx:912
#20 0x000000000040bb10 in TMS_TreeWriter::Write (this=0xdd1e20 <TMS_TreeWriter::GetWriter()::Instance>) at ../src/TMS_TreeWriter.h:42
#21 ConvertToTMSTree (filename=..., output_filename=...) at ConvertToTMSTree.cpp:243
#22 0x00000000004081b6 in main (argc=<optimized out>, argv=<optimized out>) at ConvertToTMSTree.cpp:272

Not super clear. Added the ability to do make sanitize, which checks for address errors. But this may be unrelated to the issue above:

Processed 19/2056 (0.924125%)
Overlaying 19 events
Sliced event 19 into 8 slices
Using Hough for main track finding reconstruction
Using AStar to clean up HoughTrack? 1
Using DBSCAN for clustering after main track finding? 1
=================================================================
==6997==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6210009700a0 at pc 0x7fb5f38def48 bp 0x7ffdfaba40c0 sp 0x7ffdfaba40b8
READ of size 4 at 0x6210009700a0 thread T0
    #0 0x7fb5f38def47 in TMS_TrackFinder::Accumulate(double, double) /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:
3400
    #1 0x7fb5f38df38e in TMS_TrackFinder::GetHoughLine(std::vector<TMS_Hit, std::allocator<TMS_Hit> > const&, double&, double&) /exp/dune/
app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:3358
    #2 0x7fb5f3930775 in TMS_TrackFinder::RunHough(std::vector<TMS_Hit, std::allocator<TMS_Hit> > const&, char const&) /exp/dune/app/users
/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:1923
    #3 0x7fb5f393619b in TMS_TrackFinder::HoughTransform(std::vector<TMS_Hit, std::allocator<TMS_Hit> > const&, char const&) /exp/dune/app
/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:1566
    #4 0x7fb5f39472ed in TMS_TrackFinder::FindTracks(TMS_Event&) /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:419
    #5 0x40d79c in ConvertToTMSTree(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_s
tring<char, std::char_traits<char>, std::allocator<char> >) /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/app/ConvertToTMSTree.cpp:2
26
    #6 0x4098b8 in main /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/app/ConvertToTMSTree.cpp:272
    #7 0x7fb5ea838554 in __libc_start_main (/lib64/libc.so.6+0x22554)
    #8 0x409d1b  (/exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/bin/ConvertToTMSTree.exe+0x409d1b)

0x6210009700a0 is located 0 bytes to the right of 4000-byte region [0x62100096f100,0x6210009700a0)
allocated by thread T0 here:
    #0 0x7fb5f3f08d2f in operator new[](unsigned long) ../../.././libsanitizer/asan/asan_new_delete.cc:107
    #1 0x7fb5f38eb572 in TMS_TrackFinder::TMS_TrackFinder() /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:32

SUMMARY: AddressSanitizer: heap-buffer-overflow /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Reco.cpp:3400 in TMS_TrackFind
er::Accumulate(double, double)
Shadow bytes around the buggy address:
  0x0c4280125fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c4280125fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c4280125fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c4280125ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c4280126000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c4280126010: 00 00 00 00[fa]fa fa fa fa fa fa fa fa fa fa fa
  0x0c4280126020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4280126030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4280126040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4280126050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4280126060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==6997==ABORTING

TMS_Reco:3400 is Accumulator[i][c_bin]++;. I'm pretty sure we're either above nIntercept or < 0. It's not set to print < 0, and it only sets if > nIntercept, not if == nIntercept, which would also be an error.

c: 50000
m: -5
i: 0
cbin: 1000
From config:
    MinInter = -50.0E3
    MaxInter = 50.0E3
    NSlope = 1000
    NInter = 1000

Anyway, did if (i > 0 && c_bin > 0 && i < nSlope && c_bin < nIntercept) Accumulator[i][c_bin]++ and now I get other errors. It does show that cbin can be negative though:

c: -76519.7
m: 4.99
i: 999
cbin: -265

It seems that we're able to get crazy values for c. That may be because my change broke things or because we can get really large slopes or something. Not sure. Added another check to make sure c_bin is within bounds (i should be within bounds).

I don't think the new crash is related:

==11981==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61a00071b80c at pc 0x7f2a22e7bd43 bp 0x7ffe9c921af0 sp 0x7ffe9c921ae8
READ of size 4 at 0x61a00071b80c thread T0
    #0 0x7f2a22e7bd42 in TObject::TObject(TObject const&) /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-
e20-p392-prof/include/TObject.h:261
    #1 0x7f2a22e7cbbe in TLorentzVector::TLorentzVector(TLorentzVector const&) /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/
Linux64bit+3.10-2.17-e20-p392-prof/include/TLorentzVector.h:609
    #2 0x7f2a22e5aac0 in TMS_Event::GetMuonTrueTrackLength() /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Event.cpp:1038
    #3 0x7f2a22f9ce26 in TMS_TreeWriter::Fill(TMS_Event&) /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_TreeWriter.cpp:483
    #4 0x40d7e9 in ConvertToTMSTree(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_s
tring<char, std::char_traits<char>, std::allocator<char> >) /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/app/ConvertToTMSTree.cpp:2
36
    #5 0x4098b8 in main /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/app/ConvertToTMSTree.cpp:272
    #6 0x7f2a19e38554 in __libc_start_main (/lib64/libc.so.6+0x22554)
    #7 0x409d1b  (/exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/bin/ConvertToTMSTree.exe+0x409d1b)

0x61a00071b80c is located 12 bytes to the right of 1408-byte region [0x61a00071b280,0x61a00071b800)
allocated by thread T0 here:
    #0 0x7f2a23508b5f in operator new(unsigned long) ../../.././libsanitizer/asan/asan_new_delete.cc:104
    #1 0x7f2a22e59b27 in __gnu_cxx::new_allocator<TLorentzVector>::allocate(unsigned long, void const*) /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/ext/new_allocator.h:114
    #2 0x7f2a22e59b27 in std::allocator_traits<std::allocator<TLorentzVector> >::allocate(std::allocator<TLorentzVector>&, unsigned long) /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/alloc_traits.h:444
    #3 0x7f2a22e59b27 in std::_Vector_base<TLorentzVector, std::allocator<TLorentzVector> >::_M_allocate(unsigned long) /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h:343
    #4 0x7f2a22e59b27 in std::_Vector_base<TLorentzVector, std::allocator<TLorentzVector> >::_M_create_storage(unsigned long) /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h:358
    #5 0x7f2a22e59b27 in std::_Vector_base<TLorentzVector, std::allocator<TLorentzVector> >::_Vector_base(unsigned long, std::allocator<TLorentzVector> const&) /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h:302
    #6 0x7f2a22e59b27 in std::vector<TLorentzVector, std::allocator<TLorentzVector> >::vector(std::vector<TLorentzVector, std::allocator<TLorentzVector> > const&) /cvmfs/larsoft.opensciencegrid.org/products/gcc/v9_3_0/Linux64bit+3.10-2.17/include/c++/9.3.0/bits/stl_vector.h:552
    #7 0x7f2a22e59b27 in TMS_Event::GetMuonTrueTrackLength() /exp/dune/app/users/kleykamp/tms_liam_lmao/dune-tms/src/TMS_Event.cpp:1035

SUMMARY: AddressSanitizer: heap-buffer-overflow /cvmfs/larsoft.opensciencegrid.org/products/root/v6_22_08d/Linux64bit+3.10-2.17-e20-p392-prof/include/TObject.h:261 in TObject::TObject(TObject const&)

It's another annoying TLorentzVector issue (burn them all)

17 full error

Processed 19/1948 (0.975359%)
Overlaying 19 events
Sliced event 19 into 10 slices
Using Hough for main track finding reconstruction
Using AStar to clean up HoughTrack? 1
Using DBSCAN for clustering after main track finding? 1
 *** Break *** segmentation violation
#6  0x00007f2ff31ddcb8 in main_arena () from /lib64/libc.so.6
#7  0x00007f2ffbe6a457 in TMS_TrackFinder::FindTracks (this=this
entry=0x1d2ad80 <TMS_TrackFinder::GetFinder()::Instance>, event=...) at TMS_Reco.cpp:515
#8  0x0000000000409643 in ConvertToTMSTree (filename=..., output_filename=...) at ../src/TMS_Reco.h:194
#9  0x00000000004081b6 in main (argc=<optimized out>, argv=<optimized out>) at ConvertToTMSTree.cpp:272

There were 19 interactions that were overlayed into 1 spill. So the seg fault happens when runs reco on the first slice of the first spill. Line 515 of TMS_Reco when it reads houghline:

    std::pair<bool, TF1*> houghline = HoughLinesV[linenoV];
    double slope, intercept = 0;
    GetHoughLine(Lines, slope, intercept);
    if (fabs(houghline.second->GetParameter(0) - intercept) > 1E2 ||
              fabs(houghline.second->GetParameter(1) - slope) > 1E-2) {

My guess is the TF1* is not set, but I haven't checked. @AsaNehm, maybe you can fix this one.

Overall status

I've committed what I have. I didn't get around to adding ~TMS_Event modelled on ~TMS_TrueParticle. I think that's a better solution since it would preserve the neutrino truth info. I've added make sanitize which works really well in spotting errors related to array overflows and stuff. This also found a TLorentzVector error in GetMuonTrueTrackLength which I didn't address.

In TMS_Reco, the accumulator can definitely find values outside the range of the bins. I fixed it by forcing the values to be within the bounds but we need to think about what the right behavior is. I'm not sure what would happen if we chose not to write for values outside bounds, but then if all hits are outside bounds. In that case, would the accumulator be empty and then crash?

The TF1* is giving trouble in run 17 but I didn't look into it at all. Maybe sanitize might help there.

I've added ./RunOverSeveralFiles.sh which I've been using to run over several files and gather crash rates. I'm currently running on the liam_lmao branch and outputing here: /exp/dune/data/users/${USER}/dune-tms/2024-09-19_test_final

jdkio commented 3 weeks ago

make sanitize shows errors in GetMuonTrueTrackLength related to TLorentzVector, but it seems to run when compiled with regular make. I'm at 22 files processed so far without crashes.

However, I worry that the accumulator should not round to the nearest valid index and instead just not add anything instead (with some extra checks for a completely empty accumulator). And it's cout'ing too much right now. I think this happens for busy events only I guess, where it's checking slopes using hits between two tracks or something like that

jdkio commented 2 weeks ago

This was fixed with PR #162. Closing issue