LDMX-Software / Framework

Event-by-event processing framework using CERN's ROOT and C++17
2 stars 1 forks source link

Crash when processing a small number of events when passing a large list of inputFiles #57

Closed pbutti closed 12 months ago

pbutti commented 2 years ago

I've noticed that when running ldmx-sw in run mode, if one passes a small number of events but a large list of input files, this crash will be seen: `#17 0x00007fd879ab58d2 in framework::Bus::everybodyOff (this=0x7ffc6e3eda58) at /Users/pbutti/sw/ldmx-sw/Framework/include/Framework/Bus.h:139

18 0x00007fd879ab52e5 in framework::Event::onEndOfFile (this=0x7ffc6e3ed960) at /Users/pbutti/sw/ldmx-sw/Framework/src/Framework/Event.cxx:145

19 0x00007fd879ae8938 in framework::Process::run (this=0x55714b991150) at /Users/pbutti/sw/ldmx-sw/Framework/src/Framework/Process.cxx:318

20 0x000055714981d95f in main (argc=2, argv=0x7ffc6e3edde8) at /Users/pbutti/sw/ldmx-sw/Framework/src/Framework/fire.cxx:113`

The crash is reproducible for example by requiring 1000 events and passing a list of 10 files each of 1000 events. ldmx-sw will keep opening-closing all the other files in the list and Bus will call clear on the passengers, which will eventually crash. I suppose the desired behaviour is to stop processing after the limit on the number of events is reached and avoid opening/closing the rest of the files in the input list.

tomeichlersmith commented 2 years ago

In the Recon run mode branch of Process::run, there is a check for if the processing has reached the event limit.

https://github.com/LDMX-Software/Framework/blob/638f0a1a010e85453c44a58d1de163933cb62959/src/Framework/Process.cxx#L262-L263

This check is only within each input file though, so that is the bug causing the input files to continuously be opened and immediately closed after the event limit is reached. What's concerning to me is that this is eventually causing a seg-fault. The Bus::everybodyOff function clears the internal event object map thereby deleting all event objects from memory, I'm not sure how it is obtaining pointers that have been already deleted.

Adding a break; in the following blocks and moving these blocks to after the input/output files are closed should resolve this issue.

https://github.com/LDMX-Software/Framework/blob/638f0a1a010e85453c44a58d1de163933cb62959/src/Framework/Process.cxx#L294-L300

I am working on writing a small testing area for this so I can validate that this works.

tomeichlersmith commented 2 years ago

I am not able to reproduce the crash; however, I am able to observe the open/close of the files after the event limit is reached. I will try implementing the fix and then see if that resolves the crashing.