LDMX-Software / Framework

Event-by-event processing framework using CERN's ROOT and C++17
2 stars 1 forks source link

Some Exceptions cause seg-faults #50

Closed tomeichlersmith closed 11 months ago

tomeichlersmith commented 2 years ago

Throwing exceptions from certain functions in a processor either are ignored (like in onNewRun or beforeNewRun becuase of the try-catch tree around them) or cause seg-faults (like in configure or onProcessStart). I'm not really sure why these exceptions cause seg faults, I have a suspicion that the seg-faults are being caused by the generation of the stack trace; however, I don't have time to investigate further right now.

tomeichlersmith commented 1 year ago

I'm personally getting more convinced that including a stack trace in our exceptions is not helpful and does not conform to good C++ coding practices (where exceptions should only be thrown in truly exceptional circumstances).

http://groups.di.unipi.it/~nids/docs/i_want_my_pony_or_why_you_cannot_have_cpp_exceptions_with_a_stack_trace.html

I'm leaning more towards simply removing all stack trace from exceptions and putting the expectation on the exception writer to write a thoughtful message explaining what the user needs to do to avoid program crash in that way.

tomeichlersmith commented 1 year ago

Generally, there wasn't strong support of the stack trace we generate when there is a program-ending exception; however, we do wish to maintain some sort of system that warns us if there is an unintended program crash. With this in mind, I am going to try out different methods on a situation I know will cause a program crash (e.g. a misconfiguration) and see how it is handled.

Situations to Inspect

Methods to Check

tomeichlersmith commented 11 months ago

It seems like the simulation getting an exception from the GDMLParser triggers a seg fault. Hopefully, I can reproduce this and then use it as a testing ground for how to update our errors.

tomeichlersmith commented 11 months ago

I was able to reproduce the seg fault from within the GDMLParser. Hooray!

mkdir works
cd works
cp path/to/ldmx-sw/ldmx-det-v14/* .
ln path/to/ldmx-sw/install/data/fieldmaps/Bmap* .
cd ..
mkdir breaks
cd breaks
cp ../works/detector.gdml

and then running a special config that uses a relative path allows me to test by simply moving between the works and breaksdirectories.

cd works
ldmx fire ../basic.py
# normal single-event simulation printout
cd ../breaks
ldmx fire ../basic.py
# seg fault similar to what Lauren saw

basic.py

from LDMX.Framework import ldmxcfg
p = ldmxcfg.Process( "test" )
p.maxEvents = 1
p.logFrequency = 1
p.termLogLevel = 0
p.run = 9001
p.outputFiles = [ '/dev/null' ]
from LDMX.SimCore import simulator as sim
import LDMX.Ecal.EcalGeometry
import LDMX.Hcal.HcalGeometry
mySim = sim.simulator( "mySim" )
# use relative path to test failure mode
mySim.detector = 'detector.gdml'
from LDMX.SimCore import generators as gen
mySim.generators.append( gen.single_4gev_e_upstream_tagger() )
mySim.description = 'Basic test Simulation'
p.sequence.append( mySim )
tomeichlersmith commented 11 months ago

Sad News :cry: Removing the stack trace code does not resolve the seg fault, instead it simply changes the path the stack trace shows. I fear that this seg fault specifically is coming from the GDMLParser itself and not something we can effect.

tomeichlersmith commented 11 months ago

Doing some testing in a testbench...

I found that ROOT's serialization complains when we throw and catch our own exceptions while writing to an output ROOT TTree.

eichl008@framework-testbench:~$ fire config/exceptions.py produce                                                  
---- LDMXSW: Loading configuration --------
ROOT signal handler disabled
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
 [ Process ] 1 : Processing 1 Run 1 Event 1  (2023-09-28 10:49:49.185664000-0500)
 [ fire ] 4 : [TEST] : produce
  at /export/scratch/users/eichl008/ldmx/framework-testbench/Bench/src/Bench/Exceptions.cxx:39 in produce
Stack trace: 
    0 framework::exception::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11:
:basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 202 
    1 bench::Exceptions::produce(framework::Event&) + 317 
    2 framework::Process::process(int, framework::Event&) const + 703 
    3 framework::Process::run() + 2072 
    4 main + 736 addr2line: 'fire': No such file

    5 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fd89bb11d90]
Error in <TClass::LoadClassInfo>: no interpreter information for class TVirtualStreamerInfo is available even though it has a TClass initialization routine.
malloc(): unaligned tcache chunk detected
Aborted (core dumped)

The last part about TClass is not printed if I don't write anything to an output TTree. As you can see in the printout, this is not generated by ROOT's signal handler and so I am unsure on where it is being produced from. I fear ROOT has some library offload functions that attempt to gracefully exit upon program completion, but since our libraries are offloaded before ROOT's, the class info is not available anymore and we get this error during offload. It's really annoying because, in an error situation we don't care about graceful exit, we want to show the error and leave.

tomeichlersmith commented 11 months ago

We definitely should patch Framework to catch and pretty-print STL errors since it currently just has the core dump printout which is kind-of difficult to parse especially for new users.

eichl008@framework-testbench:~$ fire config/exceptions.py produce                                                      
---- LDMXSW: Loading configuration --------
ROOT signal handler disabled
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
 [ Process ] 1 : Processing 1 Run 1 Event 1  (2023-09-28 10:56:59.095024000-0500)
terminate called after throwing an instance of 'std::runtime_error'
  what():  produce
Aborted (core dumped)

Interestingly though, this printout does not trigger the TClass issue observed earlier.

tomeichlersmith commented 11 months ago

Using branch 50-no-stack-traces where I catch std::exceptions as well as our custom exception, we now see the same TClass error for both situations, again after the program prints the exception and exits.

eichl008@framework-testbench:~$ fire config/exceptions.py produce ldmx                                             
---- LDMXSW: Loading configuration --------
ROOT signal handler disabled
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
 [ Process ] 1 : Processing 1 Run 1 Event 1  (2023-09-28 11:24:15.201842000-0500)
 [ fire ] 4 : [TEST] : produce
  at /export/scratch/users/eichl008/ldmx/framework-testbench/Bench/src/Bench/Exceptions.cxx:47 in produce
Error in <TClass::LoadClassInfo>: no interpreter information for class TVirtualStreamerInfo is available even though it has a TClass initialization routine.
malloc(): unaligned tcache chunk detected
Aborted (core dumped)
eichl008@framework-testbench:~$ fire config/exceptions.py produce stl 
---- LDMXSW: Loading configuration --------
ROOT signal handler disabled
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
 [ Process ] 1 : Processing 1 Run 1 Event 1  (2023-09-28 11:24:18.762839000-0500)
Unrecognized Exception: produce
Error in <TClass::LoadClassInfo>: no interpreter information for class TVirtualStreamerInfo is available even though it has a TClass initialization routine.
malloc(): unaligned tcache chunk detected
Aborted (core dumped)
tomeichlersmith commented 11 months ago

If we Close() the output file before program exit, then this is not an issue; however, the current implementation of EventFile::close also attempts to write the run header(s) to the output file which is a problem and will need to be separated so that is only attempted in the successful ending state.

The below is from putting EventFile::close into its destructor and removing the part of EventFile::close that writes the RunHeader.

eichl008@framework-testbench:~$ fire config/exceptions.py produce ldmx
---- LDMXSW: Loading configuration --------
ROOT signal handler disabled
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
 [ Process ] 1 : Processing 1 Run 1 Event 1  (2023-09-28 11:46:47.339125000-0500)
 [ fire ] 4 : [TEST] : produce
  at /export/scratch/users/eichl008/ldmx/framework-testbench/Bench/src/Bench/Exceptions.cxx:47 in produce
eichl008@framework-testbench:~$ fire config/exceptions.py produce stl 
---- LDMXSW: Loading configuration --------
ROOT signal handler disabled
---- LDMXSW: Configuration load complete  --------
---- LDMXSW: Starting event processing --------
 [ Process ] 1 : Processing 1 Run 1 Event 1  (2023-09-28 11:46:52.460289000-0500)
Unrecognized Exception: produce
tomeichlersmith commented 11 months ago

I'm closing this due to #78 - while it didn't resolve what I wanted it to resolve, it did seem to patch all of the exception issues I was able to reproduce from within the framework testbench. I think this still leaves open questions (below), but I'm going to leave these to future issues which can hopefully better document what was being observed and how to replicate it (shame on you past Tom :angry: )

Open Questions