JeffersonLab / halld_recon

Reconstruction for the GlueX Detector
7 stars 9 forks source link

danarest plugin crashes on os8 with memory error #635

Closed nsjarvis closed 1 year ago

nsjarvis commented 2 years ago

The files necessary to reproduce this are in /work/halld/njarvis/bug_danarest - use the script run.sh in that directory It works on ifarm (rhel7) but not on jlabl5 (rhel8)

The screen output from jlabl5 is in /work/halld/njarvis/bug_danarest/os8.out.

Key excerpts from the error messages are

double free or corruption (fasttop)   
./run.sh: line 5: 627067 Aborted                 (core dumped) hd_root etapipi-100.hddm -PPLUGINS=danarest -PNTHREADS=8
The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum https://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  DEventSourceHDDM::Extract_DMCReaction (this=0x7fc4dc001a40, record=0x7fc4dcb02290, factory=0x7fc4b84f2690, tag=..., loop=<optimized out>) at /u/group/halld/Softwar
e/builds/Linux_RHEL8-x86_64-gcc8.5.0/halld_recon/halld_recon-4.35.0/Linux_RHEL8-x86_64-gcc8.5.0/include/HDDM/hddm_s.hpp:9251
#6  0x0000000001083329 in DEventSourceHDDM::GetObjects (this=0x7fc4dc001a40, event=..., factory=0x7fc4b84f2690) at /usr/include/c++/8/bits/basic_string.h:940
#7  0x000000000081f49b in jana::JEvent::GetObjects<DMCReaction> (this=this
entry=0x7fc4b8000b68, t=std::vector of length 0, capacity 0, factory=factory
entry=0x7fc4b84f2690) at /u/group/halld/Software/builds/Linux_RHEL8-x86_64-gcc8.5.0/jana/jana_0.8.2^ccdb168/Linux_RHEL8-x86_64-gcc8.5.0/include/JANA/JEvent.h:84
#8  0x000000000081f899 in jana::JEventLoop::GetFromSource<DMCReaction> (factory=0x7fc4b84f2690, t=std::vector of length 0, capacity 0, this=0x7fc4b8000b60) at /u/group
/halld/Software/builds/Linux_RHEL8-x86_64-gcc8.5.0/jana/jana_0.8.2^ccdb168/Linux_RHEL8-x86_64-gcc8.5.0/include/JANA/JEventLoop.h:463
#9  jana::JEventLoop::GetFromFactory<DMCReaction> (this=this
entry=0x7fc4b8000b60, t=std::vector of length 0, capacity 0, tag=tag
entry=0x1203c36 "", data_source=
0x7fc4bfff7280: 4257240736, allow_deftag=allow_deftag
entry=true) at /u/group/halld/Software/builds/Linux_RHEL8-x86_64-gcc8.5.0/jana/jana_0.8.2^ccdb168/Linux_RHEL8-x86_64-gcc8.5.0/include/JANA/JEventLoop.h:427
#10 0x000000000081fa6e in jana::JEventLoop::Get<DMCReaction> (this=this
entry=0x7fc4b8000b60, t=std::vector of length 0, capacity 0, tag=tag
entry=0x1203c36 "", allow_deftag=allow_deftag
entry=true) at /usr/include/c++/8/ext/new_allocator.h:86
#11 0x000000000116d5df in DEventWriterREST::Write_RESTEvent (this=this
entry=0x7fc4b857f300, locEventLoop=locEventLoop
entry=0x7fc4b8000b60, locOutputFileNameSubString="") at libraries/HDDM/DEventWriterREST.cc:70
#12 0x00007fc4e41887b8 in JEventProcessor_danarest::evnt (this=<optimized out>, locEventLoop=0x7fc4b8000b60, eventnumber=<optimized out>) at /usr/include/c++/8/ext/new
_allocator.h:79
#13 0x00000000011a742f in jana::JEventLoop::OneEvent (this=0x7fc4b8000b60) at src/JANA/JEventLoop.cc:693
#14 0x00000000011a7a44 in jana::JEventLoop::Loop (this=this
entry=0x7fc4b8000b60) at src/JANA/JEventLoop.cc:496
#15 0x0000000001184bf9 in LaunchThread (arg=0x7ffc19aaa110) at src/JANA/JApplication.cc:1382
#16 0x00007fc4fd0c017a in start_thread () from /lib64/libpthread.so.0
#17 0x00007fc4fcdefdc3 in clone () from /lib64/libc.so.6
aaust commented 2 years ago

I can confirm your observation. I do not recommend running with 8 threads on jlabl5.

---- edit: found a mistake on my part

nsjarvis commented 2 years ago

Did it run correctly single-threaded? The idea seems familiar but I cannot remember for certain, it was so long ago.

aaust commented 1 year ago

Valgrind does not tell me much more:

==667513== 1 errors in context 4 of 127:
==667513== Invalid read of size 4
==667513==    at 0x1104721: DEventSourceHDDM::Extract_DMCReaction(hddm_s::HDDM*, jana::JFactory<DMCReaction>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, jana::JEventLoop*) (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/bin/hd_root)
==667513==    by 0x1105959: DEventSourceHDDM::GetObjects(jana::JEvent&, jana::JFactory_base*) (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/bin/hd_root)
==667513==    by 0x8024BA: jerror_t jana::JEvent::GetObjects<DMCReaction>(std::vector<DMCReaction const*, std::allocator<DMCReaction const*> >&, jana::JFactory_base*) (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/bin/hd_root)
==667513==    by 0x8028B8: jana::JFactory<DMCReaction>* jana::JEventLoop::GetFromFactory<DMCReaction>(std::vector<DMCReaction const*, std::allocator<DMCReaction const*> >&, char const*, jana::JEventLoop::data_source_t&, bool) (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/bin/hd_root)
==667513==    by 0x802A88: jana::JFactory<DMCReaction>* jana::JEventLoop::Get<DMCReaction>(std::vector<DMCReaction const*, std::allocator<DMCReaction const*> >&, char const*, bool) (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/bin/hd_root)
==667513==    by 0x1258702: DEventWriterREST::Write_RESTEvent(jana::JEventLoop*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/bin/hd_root)
==667513==    by 0x27E36356: JEventProcessor_danarest::evnt(jana::JEventLoop*, unsigned long) (in /home/aaustreg/Software/gluex_top/halld_recon/halld_recon-4.39.0/Linux_RHEL8-x86_64-gcc8.5.0/plugins/danarest.so)
==667513==    by 0x128C70E: jana::JEventLoop::OneEvent() (JEventLoop.cc:693)
==667513==    by 0x128CD23: jana::JEventLoop::Loop() (JEventLoop.cc:496)
==667513==    by 0x1269A48: LaunchThread(void*) (JApplication.cc:1382)
==667513==    by 0xD1821C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==667513==    by 0xDF06E72: clone (in /usr/lib64/libc-2.28.so)
==667513==  Address 0x1c is not stack'd, malloc'd or (recently) free'd
aaust commented 1 year ago

This is a hint: /include/HDDM/hddm_s.hpp:9251

This looks for the properties of the target

inline Properties &Target::getProperties() {
   return m_properties_link.front();
}

but this field is not present in the input <properties charge="int" mass="float"/>

aaust commented 1 year ago

And indeed, skipping over this line prevents the crash: https://github.com/JeffersonLab/halld_recon/blob/master/src/libraries/HDDM/DEventSourceHDDM.cc#L1159

aaust commented 1 year ago

Apparently, calling std::vector::front on an empty container causes undefined behavior. Since this is used everywhere in hddm-cpp, I want to hand this problem off to @rjones30

rjones30 commented 1 year ago

Alec, yes calling vector::front on an empty container causes unallocated memory access, which is a bug. Looking at it now. -Richard Jones

On Wed, Dec 14, 2022 at 2:40 PM Alex Austregesilo @.***> wrote:

Apparently, calling std::vector::front on an empty container causes undefined behavior. Since this is used everywhere in hddm-cpp, I want to hand this problem off to @rjones30 https://github.com/rjones30

— Reply to this email directly, view it on GitHub https://github.com/JeffersonLab/halld_recon/issues/635#issuecomment-1352049959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3YKWHJGLE33LZEZP6VHS3WNIPCRANCNFSM6AAAAAAQBUALGQ . You are receiving this because you were mentioned.Message ID: @.***>

rjones30 commented 1 year ago

Hello Alex and all,

The properties tag under "target" is not optional in hddm_s. You don't have to specify a <target> but if you do then you must give it <properties>. This is clear from the hddm template, as shown below. If the tag is optional then it has a minoccurs="0" attribute to indicate you can leave it out without breaking the model. Please mark this issue as resolved. I will check to see if it needs to be opened in hdgeant or hdgeant4 to make sure the outputs are compliant. -Richard Jones

 <target minOccurs="0" type="Particle_t">
     <momentum E="float" px="float" py="float" pz="float"/>
     <polarization minOccurs="0" Px="float" Py="float" Pz="float"/>
     <properties charge="int" mass="float"/>
  </target>
aaust commented 1 year ago

Ok, that makes sense. I can also confirm that the generator gen_amp and its derivatives fill the field for the target.

@nsjarvis Where did the input file (etapipi-100.hddm) come from? Can you please make sure it is recent?

rjones30 commented 1 year ago

Nevertheless, we can make our lives easier by making the reader more robust against inadvertent omissions that make little difference to the analysis like this. I have posted a PR named fix_hddm_read_crashes_rtj that I think should solve this issue. Please test and if it works, approve. -Richard

nsjarvis commented 1 year ago

Thank you for fixing this.

nsjarvis commented 1 year ago

PS I checked with a real data file (hd_rawdata_073070_001.evio) and it runs over that no problem.