Closed aaust closed 5 years ago
Simon, David, and I all realized that there is a long-standing problem with unintialized pointers in the code today. Simon will push a fix, and then some of these crash problems can be checked.
A belated update: this happens at least 50% of the time when running the offline monitoring jobs. The crash seems to happen when accessing the SC geometry info (stack trace snippet listed below). I tried looking for memory errors and bounds errors, but didn't see anything. So my guess is that there is something more subtle going on.
To reproduce this, use the attached JANA config file, and a command line like the following: hd_root --nthreads=[many] --config=jana_offmon.config /cache/halld/RunPeriod-2019-01/rawdata/Run051590/hd_rawdata_051590_003.evio
Stack trace (note that the line numbers in DParticleID might be a little wrong)
=========================================================== #6 TObject::TObject (this=0x2b286d602860, obj=...) at /u/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/core/base/src/\ TObject.cxx:82 #7 0x00000000008e7c95 in TVector3 (p=..., this=0x2b286d602860) at /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/incl\ ude/TVector3.h:259 #8 DParticleID::Distance_ToTrack (this=this entry=0x2b28b53d4330, extrapolations=std::vector of length 1, capacity 1 = {...}, locSCHit=locSCHit entry=0x2b28a37dd440, locInputStartTime=locInputStartTime entry=5.4392247200012207, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos entry=0x2b286d6028f0, locOutputProjMom=locOutputProjMom entry=0x2b286d602920) at libraries/PID/DParticleID.cc:1368 #9 0x00000000008e820d in DParticleID::Cut_MatchDistance (this=0x2b28b53d4330, extrapolations=std::vector of length 1, capacity 1 = {...}, locSCHit=\ 0x2b28a37dd440, locInputStartTime=5.4392247200012207, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased entry=true, locOutputProjPos=locOutputProjPos entry=0x2b286d602da0, locOutputProjMom=locOutputProjMom entry=0x2b286d602dd0) at libraries/PID/DParticleID.cc:1758 #10 0x00002b282a12e1dc in JEventProcessor_HLDetectorTiming::evnt (this=0x4177fc0, loop=0x2b28a00008c0, eventnumber=) at plugins/Calib\ ration/HLDetectorTiming/JEventProcessor_HLDetectorTiming.cc:975 #11 0x0000000000ff4e4a in jana::JEventLoop::OneEvent (this=this entry=0x2b28a00008c0) at src/JANA/JEventLoop.cc:693 #12 0x0000000000ff5de4 in jana::JEventLoop::Loop (this=this entry=0x2b28a00008c0) at src/JANA/JEventLoop.cc:496 #13 0x0000000000fcd09a in LaunchThread (arg=0x7ffd8d5e3d10) at src/JANA/JApplication.cc:1382 #14 0x00002b280eb44dc5 in start_thread () from /lib64/libpthread.so.0 #15 0x00002b280f87321d in clone () from /lib64/libc.so.6 ===========================================================
Line where it crashes:
// Start Counter geometry in hall coordinates, obtained from xml file double sc_pos_soss = sc_pos[sc_index][0].z(); // Start of straight section double sc_pos_eoss = sc_pos[sc_index][1].z(); // End of straight section <<---- USUALLY IT IS THIS ONE double sc_pos_eobs = sc_pos[sc_index][sc_pos[sc_index].size() - 2].z(); // End of bend section
analyzed all 232 files from run 41485 using the following plugins: -PPLUGINS=ST_Tresolution,ST_online_tracking,HLDetectorTiming,monitoring_hists with 8 threads per job. I did NOT find a single crash! I did not use the plugin ST_online_efficiency as this should be used only by experts doing an efficiency study.
Note this uses pretty much the latest halld_recon code!
analyzed all 263 files from run 41485 with the same list of plugins: ST_Tresolution,ST_online_tracking,HLDetectorTiming,monitoring_hists I did find 112 crashed jobs with the latest master!
The error message is the same as above. An example log file can be found here: /volatile/halld/home/gxproj5/ver99/log/041485/stderr.041485_259.err
run 41485 out of the 263 files 13 crashed using the same jana config file as Sean. however my code was compiled with -O 0 and -g to get all debug symbols and no optimization. The following list shows the files that crashed, all but two with the same error in HLD_DetectorTiming 41485_003 41485_037 41485_038 41485_041 41485_069 41485_080 41485_084 41485_126 41485_172 41485_197 41485_199 41485_213 41485_259
entry=true, locOutputProjPos=locOutputProjPos entry=0x2b3509e42da0, locOutputProjMom=locOutputProjMom entry=0x2b3509e42dd0) at libraries/PID/DParticleID.cc:1758
running hd_root in debug mode using gdb with 4 analysis threads on file hd_rawdata_041485_259.evio
a crash did occur but again it looks like at the end of the file. Since it was running very slow over the week end when coming in this morning typed into the debugger bt and after the "bt" output the debugger just died. see most of the relevant output below:
[zihlmann@ifarm1401 ~/HallD]$ gdb hd_root GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /w/halld-scifs17exp/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root...done. (gdb) run --nthreads=4 -PTHREAD_TIMEOUT=500 -PJANA:MAX_RELAUNCH_THREADS=10 -PPLUGINS=ST_Tresolution,ST_online_tracking,HLDetectorTiming,monitoring_hists /cache/halld/RunPeriod-2018-01/rawdata/Run041485/hd_rawdata_041485_259.evio Starting program: /work/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root --nthreads=4 -PTHREAD_TIMEOUT=500 -PJANA:MAX_RELAUNCH_THREADS=10 -PPLUGINS=ST_Tresolution,ST_online_tracking,HLDetectorTiming,monitoring_hists /cache/halld/RunPeriod-2018-01/rawdata/Run041485/hd_rawdata_041485_259.evio [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". JANA >>OUTPUT_FILENAME: hd_root.root ...
No more eventsts processed (1462.9k events read) 26.0Hz (avg.: 16.9Hz)
[Thread 0x7fffc3fff700 (LWP 7952) exited]
[Thread 0x7fffc37fe700 (LWP 7953) exited]
JANA ERROR>> didn't sleep full 0.5 seconds!
[Thread 0x7fffd9a54700 (LWP 7944) exited]ts read) 16.0Hz (avg.: 16.9Hz)
JANA ERROR>> didn't sleep full 0.5 seconds!
JANA >>0k events processed (1463.0k events read) 8.0Hz (avg.: 16.9Hz)
JANA >>No more event sources
JANA >>Thread 0x7fffd9253700 completed gracefully: Sat Apr 27 21:50:37 2019
[Thread 0x7fffd9253700 (LWP 7947) exited]
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffd3fff700 completed gracefully: Sat Apr 27 21:50:37 2019
[Thread 0x7fffd3fff700 (LWP 7949) exited]
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffd8a52700 completed gracefully: Sat Apr 27 21:50:38 2019
JANA ERROR>> didn't sleep full 0.5 seconds!
JANA >>Merging thread 0 (0x7fffd9253700) ...read) 8.0Hz (avg.: 16.9Hz)
JANA >>Merging thread 1 (0x7fffd3fff700) ...
JANA >>Merging thread 2 (0x7fffd8a52700) ...
[Thread 0x7fffd8a52700 (LWP 7948) exited]
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffd37fe700 (LWP 7950)] TObject::TObject (this=0x7fffd37f6f50, obj=...) at /u/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/core/base/src/TObject.cxx:82 82 fBits = obj.fBits; Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 expat-2.1.0-8.el7.x86_64 freetype-2.4.11-11.el7.x86_64 glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libgfortran-4.8.5-4.el7.x86_64 libquadmath-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 nss-softokn-freebl-3.16.2.3-13.el7_1.x86_64 openssl-libs-1.0.1e-42.el7.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) (gdb) (gdb) bt
BFD: reopening /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/lib/libPhysics.so: Too many open files in system
BFD: reopening /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/lib/libPhysics.so: Too many open files in system
Python Exception <class 'gdb.error'> Can't read data for section '.debug_loc' in file '/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/lib/libPhysics.so': p=..., this=0x7fffd37f6f50) at /group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/include/TVector3.h:259
locInputStartTime=locInputStartTime@entry=6.1386933326721191, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffd37f6fe0,
locOutputProjMom=locOutputProjMom@entry=0x7fffd37f7010) at libraries/PID/DParticleID.cc:1368
BFD: reopening /work/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/plugins/ST_Tresolution.so: Too many open files in system
BFD: reopening /work/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/plugins/ST_Tresolution.so: Too many open files in system
BFD: reopening /lib64/libdl.so.2: Too many open files in system
BFD: reopening /lib64/libdl.so.2: Too many open files in system
JANA ERROR>> dSegmentation fault
idn't sleep full 0.5 seconds!
cat: error while loading shared libraries: libc.so.6: cannot open shared object file: Error 23 16.9Hz)
1463.0k events processed (1463.0k events read) 0.0Hz (avg.: 16.9Hz)
[zihlmann@ifarm1401 ~/HallD]$
running in debug mode using Seans janaconfig file over the data file hd_rawdata_041485_003.evio also seems to crash but only at the end after all events are read. At least that is what gdb seems to tell me event number is 6578239 which seems to be about right for the 4th data file to be the last event.
Unknown function: fssed (1.5M events read) 50.0Hz (avg.: 34.7Hz)
Unknown function: fssed (1.5M events read) 38.0Hz (avg.: 34.7Hz)
Unknown function: fssed (1.5M events read) 50.0Hz (avg.: 34.8Hz)
Unknown function: fssed (1.5M events read) 24.0Hz (avg.: 34.8Hz)
Unknown function: fssed (1.6M events read) 40.0Hz (avg.: 34.5Hz)
Unknown function: fssed (1.6M events read) 36.0Hz (avg.: 34.5Hz)
Unknown function: fssed (1.6M events read) 44.0Hz (avg.: 33.2Hz)
Unknown function: fssed (1.6M events read) 64.0Hz (avg.: 32.9Hz)
Unknown function: fssed (1.6M events read) 46.0Hz (avg.: 31.8Hz)
Unknown function: fssed (1.6M events read) 56.0Hz (avg.: 31.6Hz)
Unknown function: fssed (1.6M events read) 30.0Hz (avg.: 30.6Hz)
Unknown function: fssed (1.6M events read) 30.0Hz (avg.: 28.7Hz)
Unknown function: fssed (1.6M events read) 26.0Hz (avg.: 28.7Hz)
No more eventsprocessed (1.6M events read) 40.0Hz (avg.: 28.8Hz)
[Thread 0x7fff837fe700 (LWP 29558) exited]
[Thread 0x7fff82ffd700 (LWP 29559) exited]
JANA ERROR>> didn't sleep full 0.5 seconds!
[Thread 0x7fffabb41700 (LWP 29548) exited]) 40.0Hz (avg.: 28.8Hz)
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffa9b3d700 completed gracefully: Fri Apr 26 23:39:05 2019
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffa8d1a700 completed gracefully: Fri Apr 26 23:39:05 2019
JANA ERROR>> didn't sleep full 0.5 seconds!
JANA >>Merging thread 0 (0x7fffa9b3d700) ... 40.0Hz (avg.: 28.8Hz)
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffaab3f700 completed gracefully: Fri Apr 26 23:39:05 2019
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffa3fff700 completed gracefully: Fri Apr 26 23:39:05 2019
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffaa33e700 completed gracefully: Fri Apr 26 23:39:05 2019
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffa37fe700 completed gracefully: Fri Apr 26 23:39:05 2019
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fffa2ffd700 completed gracefully: Fri Apr 26 23:39:05 2019
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffab340700 (LWP 29553)] DParticleID::Get_CorrectedHitEnergy (this=this@entry=0x7fff6bfec6e0, locSCHit=locSCHit@entry=0x7ffe3d65f7b0, locProjPos=...) at libraries/PID/DParticleID.cc:3563 3563 double sc_pos_eoss = sc_pos[sc_index][1].z(); // End of straight section Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 expat-2.1.0-8.el7.x86_64 freetype-2.4.11-11.el7.x86_64 glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libgfortran-4.8.5-4.el7.x86_64 libquadmath-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 nss-softokn-freebl-3.16.2.3-13.el7_1.x86_64 openssl-libs-1.0.1e-42.el7.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication.
(gdb) (gdb) (gdb) bt
locInputStartTime=locInputStartTime@entry=5.3872246742248535, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffab3388f0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338920)
at libraries/PID/DParticleID.cc:1338
locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased@entry=true, locOutputProjPos=locOutputProjPos@entry=0x7fffab338da0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338dd0)
at libraries/PID/DParticleID.cc:1758
(gdb)
(gdb) list
3558 double DParticleID::Get_CorrectedHitEnergy(const DSCHit* locSCHit,
3559 const DVector3 &locProjPos) const {
3560 // Start Counter geometry in hall coordinates, obtained from xml file
3561 unsigned int sc_index = locSCHit->sector - 1;
3562 double sc_pos_soss = sc_pos[sc_index][0].z(); // Start of straight section
3563 double sc_pos_eoss = sc_pos[sc_index][1].z(); // End of straight section
3564 double sc_pos_eobs = sc_pos[sc_index][sc_pos[sc_index].size() - 2].z(); // End of bend section
3565
3566 // Grab the pulse integral
3567 double locCorrectedHitEnergy = locSCHit->dE;
(gdb) print sc_index
No symbol "sc_index" in current context.
(gdb)
(gdb) up
locInputStartTime=locInputStartTime@entry=5.3872246742248535, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffab3388f0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338920)
at libraries/PID/DParticleID.cc:1338
1338 double locCorrectedHitEnergy=Get_CorrectedHitEnergy(locSCHit,locProjPos); (gdb) up
locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased@entry=true, locOutputProjPos=locOutputProjPos@entry=0x7fffab338da0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338dd0)
at libraries/PID/DParticleID.cc:1758
1758 if(!Distance_ToTrack(extrapolations, locSCHit, locInputStartTime, locSCHitMatchParams, &locProjPos, &locProjMom))
(gdb) list
1753 }
1754
1755 bool DParticleID::Cut_MatchDistance(const vector
1761 if(locOutputProjMom != nullptr)
1762 {
(gdb) up
975 true, &IntersectionPoint, &IntersectionMomentum);
(gdb) list
970 // need to get the projected hit position at the SC in order to cut on it
971 DVector3 IntersectionPoint, IntersectionMomentum;
972 vector
693 proc->evnt(this, event_number); (gdb) print event_number $2 = 6578239 (gdb)
This is the same location in which I see the code crashing, so good to see that the problem is reproducible by others.
On Mon, Apr 29, 2019 at 8:27 AM zihlmann notifications@github.com wrote:
running in debug mode using Seans janaconfig file over the data file hd_rawdata_041485_003.evio also seems to crash but only at the end after all events are read. At least that is what gdb seems to tell me: [zihlmann@ifarm1801 TestPlugins]$ gdb hd_root GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /w/halld-scifs17exp/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root...done. (gdb) run --nthreads=8 --config=janaconf.conf /cache/halld/RunPeriod-2018-01/rawdata/Run041485/hd_rawdata_041485_003.evio Starting program: /work/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root --nthreads=8 --config=janaconf.conf /cache/halld/RunPeriod-2018-01/rawdata/Run041485/hd_rawdata_041485_003.evio [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". JANA >>Reading configuration from "janaconf.conf" ... JANA >>OUTPUT_FILENAME: hd_root.root ..... Detaching after fork from child process 29540. Detaching after fork from child process 29542. Detaching after fork from child process 29544. Detaching after fork from child process 29546. Opened ROOT file "hd_root.root" ... INITIALIZE VARIABLES [New Thread 0x7fffabb41700 (LWP 29548)] [New Thread 0x7fffa8d1a700 (LWP 29549)] [New Thread 0x7fffa3fff700 (LWP 29550)] [New Thread 0x7fffa37fe700 (LWP 29551)] [New Thread 0x7fffa2ffd700 (LWP 29552)] [New Thread 0x7fffab340700 (LWP 29553)] [New Thread 0x7fffaab3f700 (LWP 29554)] [New Thread 0x7fffaa33e700 (LWP 29555)] [New Thread 0x7fffa9b3d700 (LWP 29556)] JANA >>Launching threads ........ JANA >>Opening source "/cache/halld/RunPeriod-2018-01/rawdata/Run041485/hd_rawdata_041485_003.evio" of type: EVIOpp - Reads EVIO formatted data from file or ET system [New Thread 0x7fff83fff700 (LWP 29557)] [New Thread 0x7fff837fe700 (LWP 29558)] [New Thread 0x7fff82ffd700 (LWP 29559)] .... JANA >>Electron beam current trip map for run 41485 loaded with 1994 boundaries (45 trips over 12785 sec) dBCALGeom->GetBCAL_center()=212.000000 Z_TARGET=65.204000 JANA >>vents processed (1.0 events read) 0.0Hz (avg.: 0.0Hz) JANA >> --- Configuration Parameters -- JANA >> EVIO:TREAT_TRUNCATED_AS_ERROR = 1 JANA >> JANA:RESOURCE_DEFAULT_PATH = JANA >> NTHREADS = 8 JANA >> PLUGINS = occupancy_online,highlevel_online,danarest,monitoring_hists,TAGH_online,TAGM_online,TAGM_clusters,TAGM_TW,BCAL_online,CDC_online,CDC_Efficiency,FCAL_online,FDC_online,FDC_Efficiency,ST_online_lowlevel,lowlevel_online,TOF_online,PS_online,PSC_online,PSPair_online,PS_flux,TPOL_online,TOF_TDC_shift,BCAL_Eff,BCAL_Hadronic_Eff,BCAL_TDC_Timing,FCAL_Hadronic_Eff,p2pi_hists,p3pi_hists,ppi0gamma_hists,HLDetectorTiming,BCAL_inv_mass,TRIG_online,CDC_drift,RF_online,CDC_expert_2,L1_online,FCAL_invmass,TrackingPulls,p4pi_hists,p2k_hists,CDC_TimeToDistance,TOF_calib,CDC_amp,BCAL_attenlength_gainratio,trackeff_missing,TPOL_tree,evio_writer,randomtrigger_skim JANA >> THREAD_TIMEOUT = 30 seconds JANA >> TRKFIT:HYPOTHESES = 2,3,8,9,11,12,14,15 JANA >> ------------------------------- .... libraries/HDGEOMETRY/DGeometry.cc:1706 Unable to retrieve ComptonEMcal position. JANA >>Error loading /CCAL/nonlinear_energy_correction ! JANA >> Enabling bz2 compression of output HDDM file stream.: 23.9Hz) JANA >> Enabling CRC data integrity check in output HDDM file stream Unknown function: fessed (38.2k events read) 28.0Hz (avg.: 35.9Hz) Unknown function: fessed (57.3k events read) 34.0Hz (avg.: 36.6Hz)
Unknown function: fssed (1.5M events read) 50.0Hz (avg.: 34.7Hz) Unknown function: fssed (1.5M events read) 38.0Hz (avg.: 34.7Hz) Unknown function: fssed (1.5M events read) 50.0Hz (avg.: 34.8Hz) Unknown function: fssed (1.5M events read) 24.0Hz (avg.: 34.8Hz) Unknown function: fssed (1.6M events read) 40.0Hz (avg.: 34.5Hz) Unknown function: fssed (1.6M events read) 36.0Hz (avg.: 34.5Hz) Unknown function: fssed (1.6M events read) 44.0Hz (avg.: 33.2Hz) Unknown function: fssed (1.6M events read) 64.0Hz (avg.: 32.9Hz) Unknown function: fssed (1.6M events read) 46.0Hz (avg.: 31.8Hz) Unknown function: fssed (1.6M events read) 56.0Hz (avg.: 31.6Hz) Unknown function: fssed (1.6M events read) 30.0Hz (avg.: 30.6Hz) Unknown function: fssed (1.6M events read) 30.0Hz (avg.: 28.7Hz) Unknown function: fssed (1.6M events read) 26.0Hz (avg.: 28.7Hz) No more eventsprocessed (1.6M events read) 40.0Hz (avg.: 28.8Hz) [Thread 0x7fff837fe700 (LWP 29558) exited] [Thread 0x7fff82ffd700 (LWP 29559) exited] JANA ERROR>> didn't sleep full 0.5 seconds! [Thread 0x7fffabb41700 (LWP 29548) exited]) 40.0Hz (avg.: 28.8Hz) JANA >> JANA >>No more event sources JANA >>Thread 0x7fffa9b3d700 completed gracefully: Fri Apr 26 23:39:05 2019 JANA >> JANA >>No more event sources JANA >>Thread 0x7fffa8d1a700 completed gracefully: Fri Apr 26 23:39:05 2019 JANA ERROR>> didn't sleep full 0.5 seconds! JANA >>Merging thread 0 (0x7fffa9b3d700) ... 40.0Hz (avg.: 28.8Hz) JANA >> JANA >>No more event sources JANA >>Thread 0x7fffaab3f700 completed gracefully: Fri Apr 26 23:39:05 2019 JANA >> JANA >>No more event sources JANA >>Thread 0x7fffa3fff700 completed gracefully: Fri Apr 26 23:39:05 2019 JANA >> JANA >>No more event sources JANA >>Thread 0x7fffaa33e700 completed gracefully: Fri Apr 26 23:39:05 2019 JANA >> JANA >>No more event sources JANA >>Thread 0x7fffa37fe700 completed gracefully: Fri Apr 26 23:39:05 2019 JANA >> JANA >>No more event sources JANA >>Thread 0x7fffa2ffd700 completed gracefully: Fri Apr 26 23:39:05 2019
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffab340700 (LWP 29553)] DParticleID::Get_CorrectedHitEnergy (this=this@entry=0x7fff6bfec6e0, locSCHit=locSCHit@entry=0x7ffe3d65f7b0, locProjPos=...) at libraries/PID/DParticleID.cc:3563 3563 double sc_pos_eoss = sc_pos[sc_index][1].z(); // End of straight section Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 expat-2.1.0-8.el7.x86_64 freetype-2.4.11-11.el7.x86_64 glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libgfortran-4.8.5-4.el7.x86_64 libquadmath-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 nss-softokn-freebl-3.16.2.3-13.el7_1.x86_64 openssl-libs-1.0.1e-42.el7.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication.
(gdb) (gdb) (gdb) bt
0 DParticleID::Get_CorrectedHitEnergy (this=this@entry=0x7fff6bfec6e0,
locSCHit=locSCHit@entry=0x7ffe3d65f7b0, locProjPos=...) at libraries/PID/DParticleID.cc:3563
1 https://github.com/JeffersonLab/halld_recon/pull/1
0x00000000008e78ca in DParticleID::Distance_ToTrack (this=this@entry=0x7fff6bfec6e0, extrapolations=std::vector of length 1, capacity 1 = {...}, locSCHit=locSCHit@entry=0x7ffe3d65f7b0, locInputStartTime=locInputStartTime@entry=5.3872246742248535, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffab3388f0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338920) at libraries/PID/DParticleID.cc:1338
2 https://github.com/JeffersonLab/halld_recon/pull/2
0x00000000008e805d in DParticleID::Cut_MatchDistance (this=0x7fff6bfec6e0, extrapolations=std::vector of length 1, capacity 1 = {...}, locSCHit=0x7ffe3d65f7b0, locInputStartTime=5.3872246742248535, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased@entry=true, locOutputProjPos=locOutputProjPos@entry=0x7fffab338da0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338dd0) at libraries/PID/DParticleID.cc:1758
3 https://github.com/JeffersonLab/halld_recon/pull/3
0x00007fffe89f71dc in JEventProcessor_HLDetectorTiming::evnt (this=0x1c1f460, loop=0x7fff900008c0, eventnumber=) at plugins/Calibration/HLDetectorTiming/JEventProcessor_HLDetectorTiming.cc:975
4 https://github.com/JeffersonLab/halld_recon/pull/4
0x0000000000ff441a in jana::JEventLoop::OneEvent (this=this@entry=0x7fff900008c0) at src/JANA/JEventLoop.cc:693
5 https://github.com/JeffersonLab/halld_recon/pull/5
0x0000000000ff53b4 in jana::JEventLoop::Loop (this=this@entry=0x7fff900008c0) at src/JANA/JEventLoop.cc:496
6 https://github.com/JeffersonLab/halld_recon/pull/6
0x0000000000fcc66a in LaunchThread (arg=0x7fffffffbc70) at src/JANA/JApplication.cc:1382
7 https://github.com/JeffersonLab/halld_recon/pull/7
0x00007ffff197adc5 in start_thread () from /lib64/libpthread.so.0
8 https://github.com/JeffersonLab/halld_recon/pull/8
0x00007ffff0c8421d in clone () from /lib64/libc.so.6 (gdb) (gdb) list 3558 double DParticleID::Get_CorrectedHitEnergy(const DSCHit* locSCHit, 3559 const DVector3 &locProjPos) const { 3560 // Start Counter geometry in hall coordinates, obtained from xml file 3561 unsigned int sc_index = locSCHit->sector - 1; 3562 double sc_pos_soss = sc_pos[sc_index][0].z(); // Start of straight section 3563 double sc_pos_eoss = sc_pos[sc_index][1].z(); // End of straight section 3564 double sc_pos_eobs = sc_pos[sc_index][sc_pos[sc_index].size() - 2].z(); // End of bend section 3565 3566 // Grab the pulse integral 3567 double locCorrectedHitEnergy = locSCHit->dE; (gdb) print sc_index No symbol "sc_index" in current context. (gdb) (gdb) up
1 https://github.com/JeffersonLab/halld_recon/pull/1
0x00000000008e78ca in DParticleID::Distance_ToTrack (this=this@entry=0x7fff6bfec6e0, extrapolations=std::vector of length 1, capacity 1 = {...}, locSCHit=locSCHit@entry=0x7ffe3d65f7b0, locInputStartTime=locInputStartTime@entry=5.3872246742248535, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffab3388f0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338920) at libraries/PID/DParticleID.cc:1338 1338 double locCorrectedHitEnergy=Get_CorrectedHitEnergy(locSCHit,locProjPos); (gdb) up
2 https://github.com/JeffersonLab/halld_recon/pull/2
0x00000000008e805d in DParticleID::Cut_MatchDistance (this=0x7fff6bfec6e0, extrapolations=std::vector of length 1, capacity 1 = {...}, locSCHit=0x7ffe3d65f7b0, locInputStartTime=5.3872246742248535, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased@entry=true, locOutputProjPos=locOutputProjPos@entry=0x7fffab338da0, locOutputProjMom=locOutputProjMom@entry=0x7fffab338dd0) at libraries/PID/DParticleID.cc:1758 1758 if(!Distance_ToTrack(extrapolations, locSCHit, locInputStartTime, locSCHitMatchParams, &locProjPos, &locProjMom)) (gdb) list 1753 } 1754 1755 bool DParticleID::Cut_MatchDistance(const vectorDTrackFitter::Extrapolation_t &extrapolations, const DSCHit locSCHit, double locInputStartTime,shared_ptr& locSCHitMatchParams, bool locIsTimeBased, DVector3 locOutputProjPos, DVector3 *locOutputProjMom) const 1756 { 1757 DVector3 locProjPos, locProjMom; 1758 if(!Distance_ToTrack(extrapolations, locSCHit, locInputStartTime, locSCHitMatchParams, &locProjPos, &locProjMom)) 1759 return false; 1760 1761 if(locOutputProjMom != nullptr) 1762 { (gdb) up
3 https://github.com/JeffersonLab/halld_recon/pull/3
0x00007fffe89f71dc in JEventProcessor_HLDetectorTiming::evnt (this=0x1c1f460, loop=0x7fff900008c0, eventnumber=) at plugins/Calibration/HLDetectorTiming/JEventProcessor_HLDetectorTiming.cc:975 975 true, &IntersectionPoint, &IntersectionMomentum); (gdb) list 970 // need to get the projected hit position at the SC in order to cut on it 971 DVector3 IntersectionPoint, IntersectionMomentum; 972 vectorDTrackFitter::Extrapolation_t extrapolations = locTrackTimeBased->extrapolations.at(SYS_START); 973 shared_ptr locSCHitMatchParams2; 974 bool sc_match_pid = dParticleID->Cut_MatchDistance(extrapolations, locSCHitMatchParams->dSCHit, locSCHitMatchParams->dSCHit->t, locSCHitMatchParams2, 975 true, &IntersectionPoint, &IntersectionMomentum); 976 double locSCzIntersection = IntersectionPoint.z(); 977 if( locSCzIntersection < 83. ) { 978 Fill1DHistogram("HLDetectorTiming", "SC_Target_RF_Compare", name, 979 flightTimeCorrectedSCTime - locShiftedTime, (gdb)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/71#issuecomment-487559826, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJAS2TCCBPGL6MJ3ZJOAA3PS3SU3ANCNFSM4GLDR6OQ .
I got the below message from Beni. This suggests something is happening towards the end, when memory might be being freed. I am going to look into when destructors are being called / handled.
/w/halld-scifs17exp/halld2/home/zihlmann/HallD/work/sctest you find 2 evio files f003.evio f259.evio
the f003 crashes with --nthreads=8 -PPLUGINS=ST_Tresolution,ST_online_tracking,HLDetectorTiming,monitoring_hists f003.evio
the f259 crashes with --nthreads=8 --config=TestPlugins/janaconf.conf f259.evio
note that both will NOT crash with --nthreads=1 the crash is always the last event!
I could reproduce the first crash, not the second one.
Same here
On Wed, May 1, 2019 at 10:24 AM Alex Austregesilo notifications@github.com wrote:
I could reproduce the first crash, not the second one.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/71#issuecomment-488296293, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJAS2UIV5UU7QFVMNQPNBLPTGR27ANCNFSM4GLDR6OQ .
I get the error see blow:
[zihlmann@ifarm1402 sctest]$ gdb hd_root
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /w/halld-scifs17exp/halld2/home/zihlmann/HallD/halld_recon/Linux_CentOS7-x86_64-gcc4.8.5/bin/hd_root...done.
(gdb) run --nthreads=8 --config=TestPlugins/janaconf.conf f259.evio
...
JANA >>
JANA >>No more event sources
JANA >>Thread 0x7fff837fe700 completed gracefully: Wed May 1 11:29:03 2019
[Thread 0x7fffa27fc700 (LWP 8710) exited]
[Thread 0x7fff837fe700 (LWP 8715) exited]
JANA ERROR>> didn't sleep full 0.5 seconds!
JANA >>Merging thread 0 (0x7fff83fff700) ...) 34.0Hz (avg.: 29.7Hz)
JANA >>Merging thread 1 (0x7fffa1ffb700) ...
JANA >>Merging thread 2 (0x7fff82ffd700) ...
JANA >>Merging thread 3 (0x7fffa2ffd700) ...
JANA >>Merging thread 4 (0x7fffa27fc700) ...
JANA >>Merging thread 5 (0x7fff837fe700) ...
Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffa17fa700 (LWP 8712)] TObject::TObject (this=0x7fffa17f2860, obj=...) at /u/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/core/base/src/TObject.cxx:82 82 fBits = obj.fBits; Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 expat-2.1.0-8.el7.x86_64 freetype-2.4.11-11.el7.x86_64 glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libgfortran-4.8.5-4.el7.x86_64 libquadmath-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libstdc++-4.8.5-4.el7.x86_64 nss-softokn-freebl-3.16.2.3-13.el7_1.x86_64 openssl-libs-1.0.1e-42.el7.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) bt
locInputStartTime=locInputStartTime@entry=6.1386933326721191, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffa17f28f0,
locOutputProjMom=locOutputProjMom@entry=0x7fffa17f2920) at libraries/PID/DParticleID.cc:1368
locInputStartTime=6.1386933326721191, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased@entry=true, locOutputProjPos=locOutputProjPos@entry=0x7fffa17f2da0,
locOutputProjMom=locOutputProjMom@entry=0x7fffa17f2dd0) at libraries/PID/DParticleID.cc:1758
at plugins/Calibration/HLDetectorTiming/JEventProcessor_HLDetectorTiming.cc:975
(gdb) (gdb) up
locInputStartTime=locInputStartTime@entry=6.1386933326721191, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locOutputProjPos=locOutputProjPos@entry=0x7fffa17f28f0,
locOutputProjMom=locOutputProjMom@entry=0x7fffa17f2920) at libraries/PID/DParticleID.cc:1368
1368 DVector3 locPaddleNorm=sc_norm[sc_index][locSCPlane]; (gdb) up
locInputStartTime=6.1386933326721191, locSCHitMatchParams=std::shared_ptr (empty) 0x0, locIsTimeBased=locIsTimeBased@entry=true, locOutputProjPos=locOutputProjPos@entry=0x7fffa17f2da0,
locOutputProjMom=locOutputProjMom@entry=0x7fffa17f2dd0) at libraries/PID/DParticleID.cc:1758
1758 if(!Distance_ToTrack(extrapolations, locSCHit, locInputStartTime, locSCHitMatchParams, &locProjPos, &locProjMom)) (gdb) up
at plugins/Calibration/HLDetectorTiming/JEventProcessor_HLDetectorTiming.cc:975
975 true, &IntersectionPoint, &IntersectionMomentum); (gdb) up
693 proc->evnt(this, event_number); (gdb) print event_number $1 = 409121440 (gdb)
The current theory is that this error is coming from trying to access a DParticleID object after it has been destroyed. Obviously this is bad! Trying to track down why this is happening.
I have put in a pull request that should address this issue. At least it appears to fix the problem in the testing I did using Beni's f003.evio file.
The main issue was that a pointer to a DParticleID object was obtained in the JEventProcessor's brun method and then used in all subsequent calls to evnt. This meant that even though a different DParticleID object was created for each thread, only one was being used by all threads. At the end of the program as threads ended, they released their associated memory which included their DParticleID objects. Once the thread that owned the "one" DParticleID object actually used by the JEventProcessor deleted it, the pointer became invalid causing any remaining threads to crash. (Or at least crash once the memory pointed to by the invalid pointer was corrupted).
This pattern of grabbing a persistent "algorithm" object while in brun and remembering the pointer for later use in evnt works fine in factories. It has in fact been in use for the DTrackFitter and various geometry objects for quite some time. At some point it migrated over to the JEventProcessor objects in numerous plugins where it is susceptible to these types of crashes.
This fixes one such instance (actually 2 since the DRFBunch_factory has the same issue). Others will need to be addressed in subsequent pull requests. It looks like this pattern may exist in quite a few plugins so a full survey will be needed to completely cleanse the disease.
We did not observe any crashes of this kind in the monitoring launches after the relevent pull requests.
Several monitoring plugins crash hd_root:
ST_Tresolution,ST_online_tracking,ST_online_efficiency,HLDetectorTiming
The faults are not fully reproducible, but all 4 plugins together result in a failure rate of about 50%.
The lines below might hint at the cause of the crash. They were taken from 2018-01 monitoring launch ver20, run 41571 file 001.