Closed aaust closed 6 months ago
at which event number does it crash? I am currently running it already passing another 1.2M events with no crash using a compiled hd_root with debug option 1 and OPTIMIZATION option 0. In case it does not crash I will change to a version with OPTIMIZATION option 3. Forgot: I am running on ifarm9 (ALMA9)
It crashes after 16.9k events were processed.
I am currently tracing the nan up the code. Here is a division by zero (e0=0 in this case): https://github.com/JeffersonLab/halld_recon/blob/1d98aa5c06198d83129c32a2ab27f23e6feb4857/src/libraries/CCAL/DCCALShower_factory.cc#L2446
I can confirm that with OPTIMIZATION=0 hd_root does not crash on ifarm9, but with OPTIMIZATION=3 hd_root does indeed crash after 16.9k events. However, while I do see a crash I do not see it at the same location I see it in the DCCALShower_factory::cell_hyc() method and the reason is that both input parameters to this method (dx and dy) are nan. see below:
Thread 7 "hd_root" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdcf33640 (LWP 3179702)]
0x0000000000edea9c in DCCALShower_factory::cell_hyc (this=this@entry=0x7fffcc109ca0, dx=dx@entry=-nan(0x8000000000000), dy=dy@entry=-nan(0x8000000000000)) at libraries/CCAL/DCCALShower_factory.cc:2347
2347 acell[i+1][j] wx (1.-wy) +
(gdb)
(gdb) print cell_hyc
$1 =
x1=@0x7fffdcf28d68: 0, y1=@0x7fffdcf28d70: 0, e2=@0x7fffdcf28d78: 0, x2=@0x7fffdcf28d80: 0, y2=@0x7fffdcf28d88: 0) at libraries/CCAL/DCCALShower_factory.cc:2019
2019 tgamma_hyc( nadc, ia, id, nzero, iaz, chisq, ee, xx, yy, e2, x2, y2 ); (gdb) up
gammas=std::vector of length 1, capacity 1 = {...}) at libraries/CCAL/DCCALShower_factory.cc:1625
1625 gamma_hyc( leng, iwrk_a, iwrk_d, chisq,
(gdb) list
1620 iy = ia[ic] - ix*100;
1621
1622 itype = peak_type( ix, iy );
1623
1624 e2 = 0.;
1625 gamma_hyc( leng, iwrk_a, iwrk_d, chisq,
1626 e1, x1, y1, e2, x2, y2 );
1627
1628 gamma_t gam1;
1629 gamma_t gam2;
(gdb) print leng
$15 = 3
(gdb) print iwrk_a
$16 = std::vector of length 3, capacity 4 = {907, 1007, 1107}
(gdb) print iwrk_d
$17 = std::vector of length 3, capacity 4 = {0, 0, 0}
(gdb) print chisq
$18 = 32781391.264957264
(gdb) print e1
$19 = 0
(gdb) print x1
$20 = 0
(gdb) print y1
$21 = 0
(gdb) print e2
$22 = 0
(gdb) print x2
$23 = 0
(gdb) print y2
$24 = 0
(gdb)
(gdb) up
at /usr/include/c++/11/bits/stl_vector.h:1043
1043 operator[](size_type __n) _GLIBCXX_NOEXCEPT (gdb) up
332 main_island( ia, id, gammas ); (gdb) print eventnumber $25 = 383968645 (gdb)
it also crashes with OPTIMIZATION=2 at the same location but I have a little bit more information from gdb: again it is a nan issue causing for example j to be a ridiculously large negative number.
Thread 4 "hd_root" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdcf33640 (LWP 3186800)]
0x0000000000e5f3dc in DCCALShower_factory::cell_hyc (this=this@entry=0x7fffcc109fb0, dx=dx@entry=-nan(0x8000000000000), dy=dy@entry=-nan(0x8000000000000)) at libraries/CCAL/DCCALShower_factory.cc:2347
2347 acell[i+1][j] wx (1.-wy) +
(gdb) list
2342
2343 wx = ax-static_cast
2351 } else cell_hyc = 0.;
(gdb) print i
$1 =
FIX: if I change line 2341 in DCCAL_Shower_factgory.cc from this if( i < 499 && j < 499 && i >= 0 && j >= 0 ) { to this if( (i < 499) && (j < 499) && (i >= 0) && (j >= 0) ) {
there is no more crash.
PS I consider this: if( i < 499 && j < 499 && i >= 0 && j >= 0 ) { not good programming. what is the precedence between <, && and >=?? I recommend to program in such a way that such a question would never needed to be raised.
Yes, that is the same location, I just traced the bogus number back a few steps to the division by zero. I can easily guard against it, but it would be good if the authors of the code could confirm.
I also coincides with this annoying print out: https://github.com/JeffersonLab/halld_recon/blob/1d98aa5c06198d83129c32a2ab27f23e6feb4857/src/libraries/CCAL/DCCALShower_factory.cc#L2246
https://en.cppreference.com/w/cpp/language/operator_precedence
I don't see why Beni's fix makes a difference as the relational operators should take precedence anyway. I wonder if there is a deeper bug (e.g., read/write of uninitialized memory) and Beni's change just reshuffles things a bit to avoid a crash.
Matt
On Mar 22, 2024, at 10:37 AM, zihlmann @.***> wrote:
PS I consider this: if( i < 499 && j < 499 && i >= 0 && j >= 0 ) { not good programming. what is the precedence between <, && and >=?? I recommend to program in such a way that such a question would never needed to be raised.
— Reply to this email directly, view it on GitHubhttps://github.com/JeffersonLab/halld_recon/issues/789#issuecomment-2015242838, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADG5L2GXUTZNUIAV37HIASLYZQ63ZAVCNFSM6AAAAABFCN35IWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGI2DEOBTHA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
It's a comparison with an integer overflow, anything can happen ;)
the crash happens within the if-block but if the if conditions is formulated correctly an j=-2147483648 should cause this if statement to be false. But that is not what happened. also as A.A. points out: -2147483648 == 0x80000000
Fixed by #790
A considerable number of jobs in the latest monitoring launch over PrimEx data show a crash in the DCCALShower_factory. The occurrence seems higher on Alma9 than on CentOS7. The following command can quickly reproduce the crash on ifarm9 with the current default halld_recon 4.44.0:
With debug symbols, the relevant part of the stack trace look like this:
This example does not produce a crash on CentOS7, even though I have seen it in other occasions before.