JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
6 stars 8 forks source link

Crash in DCCALShower_factory #789

Closed aaust closed 6 months ago

aaust commented 6 months ago

A considerable number of jobs in the latest monitoring launch over PrimEx data show a crash in the DCCALShower_factory. The occurrence seems higher on Alma9 than on CentOS7. The following command can quickly reproduce the crash on ifarm9 with the current default halld_recon 4.44.0:

hd_root -PEVENTS_TO_SKIP=900000 -PPLUGINS=monitoring_hists /cache/halld/RunPeriod-2022-08/rawdata/Run111998/hd_rawdata_111998_120.evio

With debug symbols, the relevant part of the stack trace look like this:

===========================================================
#5  0x0000000000e5e60c in DCCALShower_factory::cell_hyc (this=this
entry=0x7fad2810a2b0, dx=dx
entry=-nan(0x8000000000000), dy=dy
entry=-nan(0x8000000000000)) at libraries/CCAL/DCCALShower_factory.cc:2347
#6  0x0000000000e5fc64 in DCCALShower_factory::tgamma_hyc (this=0x7fad2810a2b0, nadc=<optimized out>, ia=..., id=..., nzero=<optimized out>, iaz=std::vector of length 12, capacity 12 = {...}, chisq=
0x7fad41760dc8: 3696776.769230769, e1=
0x7fad41760bb8: 6.9357763992277411e-310, x1=
0x7fad41760bc0: 6.9357975039374477e-310, y1=
0x7fad41760bc8: 6.9358182941694794e-310, e2=
0x7fad41760de8: 0, x2=
0x7fad41760df0: 0, y2=
0x7fad41760df8: 0) at libraries/CCAL/DCCALShower_factory.cc:2485
#7  0x0000000000e6279f in DCCALShower_factory::gamma_hyc (this=0x7fad2810a2b0, nadc=3, ia=std::vector of length 3, capacity 3 = {...}, id=..., chisq=
0x7fad41760dc8: 3696776.769230769, e1=
0x7fad41760dd0: 0, x1=
0x7fad41760dd8: 0, y1=
0x7fad41760de0: 0, e2=
0x7fad41760de8: 0, x2=
0x7fad41760df0: 0, y2=
0x7fad41760df8: 0) at libraries/CCAL/DCCALShower_factory.cc:2019
#8  0x0000000000e674a2 in DCCALShower_factory::gams_hyc (this=0x7fad2810a2b0, nadc=3, ia=std::vector of length 3, capacity 3 = {...}, id=std::vector of length 3, capacity 3 = {...}, nadcgam=
0x7fad417617bc: 2, gammas=std::vector of length 1, capacity 1 = {...}) at libraries/CCAL/DCCALShower_factory.cc:1625
#9  0x0000000000e6900b in DCCALShower_factory::main_island (this=0x7fad2810a2b0, ia=std::vector of length 3, capacity 4 = {...}, id=std::vector of length 3, capacity 4 = {...}, gammas=std::vector of length 1, capacity 1 = {...}) at /usr/include/c++/11/bits/stl_vector.h:1043
#10 0x0000000000e6ad8c in DCCALShower_factory::evnt (this=0x7fad2810a2b0, locEventLoop=<optimized out>, eventnumber=383968645) at libraries/CCAL/DCCALShower_factory.cc:332
#11 0x00000000008ad7b2 in jana::JFactory<DCCALShower>::Get (this=this
entry=0x7fad2810a2b0, d=std::vector of length 0, capacity 0) at /w/halld-scshelf2101/halld2/home/aaustreg/Analysis/sw/Alma9_new/jana/jana_0.8.2^ccdb1611/Linux_Alma9-x86_64-gcc11.4.1/include/JANA/JFactory.h:295
#12 0x00000000008ae16d in jana::JEventLoop::GetFromFactory<DCCALShower> (this=this
entry=0x7fad28000b60, t=std::vector of length 0, capacity 0, tag=tag
entry=0x13f2214 "", data_source=
0x7fad417638a0: jana::JEventLoop::DATA_FROM_CACHE, allow_deftag=allow_deftag
entry=true) at /w/halld-scshelf2101/halld2/home/aaustreg/Analysis/sw/Alma9_new/jana/jana_0.8.2^ccdb1611/Linux_Alma9-x86_64-gcc11.4.1/include/JANA/JEventLoop.h:452
#13 0x00000000008ae475 in jana::JEventLoop::Get<DCCALShower> (this=this
entry=0x7fad28000b60, t=std::vector of length 0, capacity 0, tag=tag
entry=0x13f2214 "", allow_deftag=allow_deftag
entry=true) at /usr/include/c++/11/ext/new_allocator.h:89
#14 0x0000000000889362 in DHistogramAction_NumReconstructedObjects::Perform_Action (this=this
entry=0x337d8a0, locEventLoop=locEventLoop
entry=0x7fad28000b60, locParticleCombo=locParticleCombo
entry=0x0) at libraries/ANALYSIS/DHistogramActions_Independent.cc:3604
#15 0x00007fad4a258e11 in DAnalysis::DAnalysisAction::operator() (locEventLoop=0x7fad28000b60, this=0x337d8a0) at libraries/ANALYSIS/DAnalysisAction.h:125
#16 DEventProcessor_monitoring_hists::evnt (this=0x337bf40, locEventLoop=0x7fad28000b60, eventnumber=<optimized out>) at plugins/Analysis/monitoring_hists/DEventProcessor_monitoring_hists.cc:134
#17 0x0000000001394792 in jana::JEventLoop::OneEvent (this=0x7fad28000b60) at src/JANA/JEventLoop.cc:693
#18 0x0000000001394db4 in jana::JEventLoop::Loop (this=this
entry=0x7fad28000b60) at src/JANA/JEventLoop.cc:496
#19 0x000000000136a105 in LaunchThread (arg=0x7ffdb2289540) at src/JANA/JApplication.cc:1382
#20 0x00007fad5a89f802 in start_thread () from /lib64/libc.so.6
#21 0x00007fad5a83f450 in clone3 () from /lib64/libc.so.6

===========================================================

This example does not produce a crash on CentOS7, even though I have seen it in other occasions before.

zihlmann commented 6 months ago

at which event number does it crash? I am currently running it already passing another 1.2M events with no crash using a compiled hd_root with debug option 1 and OPTIMIZATION option 0. In case it does not crash I will change to a version with OPTIMIZATION option 3. Forgot: I am running on ifarm9 (ALMA9)

aaust commented 6 months ago

It crashes after 16.9k events were processed.

I am currently tracing the nan up the code. Here is a division by zero (e0=0 in this case): https://github.com/JeffersonLab/halld_recon/blob/1d98aa5c06198d83129c32a2ab27f23e6feb4857/src/libraries/CCAL/DCCALShower_factory.cc#L2446

zihlmann commented 6 months ago

I can confirm that with OPTIMIZATION=0 hd_root does not crash on ifarm9, but with OPTIMIZATION=3 hd_root does indeed crash after 16.9k events. However, while I do see a crash I do not see it at the same location I see it in the DCCALShower_factory::cell_hyc() method and the reason is that both input parameters to this method (dx and dy) are nan. see below:

Thread 7 "hd_root" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffdcf33640 (LWP 3179702)] 0x0000000000edea9c in DCCALShower_factory::cell_hyc (this=this@entry=0x7fffcc109ca0, dx=dx@entry=-nan(0x8000000000000), dy=dy@entry=-nan(0x8000000000000)) at libraries/CCAL/DCCALShower_factory.cc:2347 2347 acell[i+1][j] wx (1.-wy) + (gdb) (gdb) print cell_hyc $1 = (gdb) print i $2 = (gdb) print dx $3 = -nan(0x8000000000000) (gdb) print dy $4 = -nan(0x8000000000000) (gdb) (gdb) up

2 0x0000000000ee3f09 in DCCALShower_factory::gamma_hyc (this=0x7fffcc109ca0, nadc=3, ia=std::vector of length 3, capacity 3 = {...}, id=..., chisq=@0x7fffdcf28d58: 32781391.264957264, e1=@0x7fffdcf28d60: 0,

x1=@0x7fffdcf28d68: 0, y1=@0x7fffdcf28d70: 0, e2=@0x7fffdcf28d78: 0, x2=@0x7fffdcf28d80: 0, y2=@0x7fffdcf28d88: 0) at libraries/CCAL/DCCALShower_factory.cc:2019

2019 tgamma_hyc( nadc, ia, id, nzero, iaz, chisq, ee, xx, yy, e2, x2, y2 ); (gdb) up

3 0x0000000000ee784c in DCCALShower_factory::gams_hyc (this=0x7fffcc109ca0, nadc=3, ia=std::vector of length 3, capacity 3 = {...}, id=std::vector of length 3, capacity 3 = {...}, nadcgam=@0x7fffdcf2974c: 2,

gammas=std::vector of length 1, capacity 1 = {...}) at libraries/CCAL/DCCALShower_factory.cc:1625

1625 gamma_hyc( leng, iwrk_a, iwrk_d, chisq, (gdb) list 1620 iy = ia[ic] - ix*100; 1621
1622 itype = peak_type( ix, iy ); 1623
1624 e2 = 0.; 1625 gamma_hyc( leng, iwrk_a, iwrk_d, chisq, 1626 e1, x1, y1, e2, x2, y2 ); 1627
1628 gamma_t gam1; 1629 gamma_t gam2; (gdb) print leng $15 = 3 (gdb) print iwrk_a $16 = std::vector of length 3, capacity 4 = {907, 1007, 1107} (gdb) print iwrk_d $17 = std::vector of length 3, capacity 4 = {0, 0, 0} (gdb) print chisq $18 = 32781391.264957264 (gdb) print e1 $19 = 0 (gdb) print x1 $20 = 0 (gdb) print y1 $21 = 0 (gdb) print e2 $22 = 0 (gdb) print x2 $23 = 0 (gdb) print y2 $24 = 0 (gdb) (gdb) up

4 0x0000000000ee9ffd in DCCALShower_factory::main_island (this=0x7fffcc109ca0, ia=std::vector of length 3, capacity 4 = {...}, id=std::vector of length 3, capacity 4 = {...}, gammas=std::vector of length 1, capacity 1 = {...})

at /usr/include/c++/11/bits/stl_vector.h:1043

1043 operator[](size_type __n) _GLIBCXX_NOEXCEPT (gdb) up

5 0x0000000000eebf4d in DCCALShower_factory::evnt (this=0x7fffcc109ca0, locEventLoop=, eventnumber=383968645) at libraries/CCAL/DCCALShower_factory.cc:332

332 main_island( ia, id, gammas ); (gdb) print eventnumber $25 = 383968645 (gdb)

zihlmann commented 6 months ago

it also crashes with OPTIMIZATION=2 at the same location but I have a little bit more information from gdb: again it is a nan issue causing for example j to be a ridiculously large negative number.

Thread 4 "hd_root" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffdcf33640 (LWP 3186800)] 0x0000000000e5f3dc in DCCALShower_factory::cell_hyc (this=this@entry=0x7fffcc109fb0, dx=dx@entry=-nan(0x8000000000000), dy=dy@entry=-nan(0x8000000000000)) at libraries/CCAL/DCCALShower_factory.cc:2347 2347 acell[i+1][j] wx (1.-wy) + (gdb) list 2342
2343 wx = ax-static_cast(i); 2344 wy = ay-static_cast(j); 2345 //std::cout << "i " << i << " j " << j << " wx " << wx << " wy " << wy << std::endl; 2346 cell_hyc = acell[i][j] (1.-wx) (1.-wy) + 2347 acell[i+1][j] wx (1.-wy) + 2348 acell[i][j+1] (1.-wx) wy + 2349 acell[i+1][j+1] wx wy; 2350
2351 } else cell_hyc = 0.; (gdb) print i $1 = (gdb) print j $2 = -2147483648 (gdb) print wx $3 = nan(0x8000000000000) (gdb) print wy $4 = (gdb) print ax $5 = (gdb) print ay $6 = nan(0x8000000000000) (gdb) print dx $7 = -nan(0x8000000000000) (gdb) print dy $8 = -nan(0x8000000000000) (gdb)

zihlmann commented 6 months ago

FIX: if I change line 2341 in DCCAL_Shower_factgory.cc from this if( i < 499 && j < 499 && i >= 0 && j >= 0 ) { to this if( (i < 499) && (j < 499) && (i >= 0) && (j >= 0) ) {

there is no more crash.

zihlmann commented 6 months ago

PS I consider this: if( i < 499 && j < 499 && i >= 0 && j >= 0 ) { not good programming. what is the precedence between <, && and >=?? I recommend to program in such a way that such a question would never needed to be raised.

aaust commented 6 months ago

Yes, that is the same location, I just traced the bogus number back a few steps to the division by zero. I can easily guard against it, but it would be good if the authors of the code could confirm.

I also coincides with this annoying print out: https://github.com/JeffersonLab/halld_recon/blob/1d98aa5c06198d83129c32a2ab27f23e6feb4857/src/libraries/CCAL/DCCALShower_factory.cc#L2246

mashephe commented 6 months ago

https://en.cppreference.com/w/cpp/language/operator_precedence

I don't see why Beni's fix makes a difference as the relational operators should take precedence anyway. I wonder if there is a deeper bug (e.g., read/write of uninitialized memory) and Beni's change just reshuffles things a bit to avoid a crash.

Matt

On Mar 22, 2024, at 10:37 AM, zihlmann @.***> wrote:

PS I consider this: if( i < 499 && j < 499 && i >= 0 && j >= 0 ) { not good programming. what is the precedence between <, && and >=?? I recommend to program in such a way that such a question would never needed to be raised.

— Reply to this email directly, view it on GitHubhttps://github.com/JeffersonLab/halld_recon/issues/789#issuecomment-2015242838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADG5L2GXUTZNUIAV37HIASLYZQ63ZAVCNFSM6AAAAABFCN35IWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGI2DEOBTHA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

aaust commented 6 months ago

It's a comparison with an integer overflow, anything can happen ;)

zihlmann commented 6 months ago

the crash happens within the if-block but if the if conditions is formulated correctly an j=-2147483648 should cause this if statement to be false. But that is not what happened. also as A.A. points out: -2147483648 == 0x80000000

aaust commented 6 months ago

Fixed by #790