art-framework-suite / art

The implementation of the art physics event processing framework
Other
2 stars 7 forks source link

art exits with segmentation fault (sigsegv) while claiming to exit with signal 1 when xrootd fails on secondary input #112

Closed knoepfel closed 2 years ago

knoepfel commented 2 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/26250 (FNAL account required) Originally created by @brownd1978 on 2021-09-09 16:04:26


Mu2e observes that art exits with sigsegv when there is a problem accessing a secondary input stream file via xrootd. While the xrootd problem has nothing to do with art, the erroneous return code makes it difficult to distinguish IO problems from program bugs. One can reproduce the problem using the SimJob MDC2020k musing, running /mu2e/app/users/brownd/MDC2020j/debug2.fcl on the build01 machine. Since the secondary input file is specified to be accessed via xrootd, and mu2ebuild01 doesn't have xrootd access, the execution fails as listed below. However, while art claims to exit with status 1, the process actually exits with status sigsegv. While the setup of running an xrootd job on mu2ebuild01 is artificial, we believe we are seeing the same effect in grid jobs when access to the 2ndary file via xrootd fails during the job (presumably due to a network issue). The symptom is that some jobs return status sigsegv, but run to completion when resubmitted or run interactively. We also searched for memory errors in the job as an alternate explanation for the irreproducible behavior but didn't see any. Note that this job is running in multi-threaded mode, with 2 threads. Note too this was observed running art v03_09_03, but that wasn't an option in the pulldown menu.

> mu2e -c debug2.fcl --nevts 1
...
%MSG-s ArtException:  PostEndJob 09-Sep-2021 10:18:20 CDT ModuleEndJob
---- EventProcessorFailure BEGIN
  EventProcessor: an exception occurred during current event processing
  ---- ScheduleExecutionFailure BEGIN
    Path: ProcessingStopped.
    ---- FileOpenError BEGIN
      ---- FatalRootError BEGIN
        Fatal Root Error: TNetXNGFile::Open
        [FATAL] Auth failed
        ROOT severity: 3000
      ---- FatalRootError END
      Unable to open specified secondary event stream file root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/mu2e/scratch/datasets/phy-sim/sim/mu2e/MuBeamCat/MDC2020k/art/87/c4/sim.mu2e.MuBeamCat.MDC2020k.001201_00000000.art.
      The above exception was thrown while processing module ResamplingMixer/beamResampler run: 1201 subRun: 349 event: 2
    ---- FileOpenError END
    Exception going through path earlyFlashPath
  ---- ScheduleExecutionFailure END
---- EventProcessorFailure END
---- EventProcessorFailure BEGIN
  EventProcessor: an exception occurred during current event processing
  ---- ScheduleExecutionFailure BEGIN
    Path: ProcessingStopped.
    ---- BADINPUT BEGIN
      Mu2eG4MT::endSubRun() Error: inconsistent simStage: 1 vs 0
      The above exception was thrown while processing module Mu2eG4MT/g4run run: 1201 subRun: 349
    ---- BADINPUT END
    Exception going through path flashPath
  ---- ScheduleExecutionFailure END
---- EventProcessorFailure END
%MSG
Art has completed and will exit with status 1.
Segmentation fault

To setup environment (additional information from Dave)

$ ssh youraccount@mu2ebuild01
$ source /cvmfs/mu2e.opensciencegrid.org/setupmu2e_art.sh
$ setup mu2e
$ setup muse
$ muse setup SimJob MDC2020k
$ mu2e -c /mu2e/app/users/brownd/MDC2020j/debug2.fcl
knoepfel commented 2 years ago

Comment by @knoepfel on 2021-09-13 21:28:44


Dave, can you provide setup instructions for a Mu2e environment in which I can use the debug2.fcl file (muse commands, etc.)?

knoepfel commented 2 years ago

Comment by @knoepfel on 2021-09-14 21:01:37


The problem has been reproduced. Initial analysis indicates faulty framework scheduling (in the context of an exception) of calling Mu2eMT::endRun, which is responsible for cleaning up G4 thread-local static data. More investigation is required.

Dave, do you know where I can locate the Offline source code that was used to generate the libraries loaded when running the job above?

knoepfel commented 2 years ago

Comment by @knoepfel on 2021-09-15 20:28:42


I am able to reproduce this issue with the attached trimmed FHiCL file, which configures art to use only 1 schedule and 1 thread.

knoepfel commented 2 years ago

Comment by @knoepfel on 2021-09-15 21:31:07


Contrary to my initial analysis, further investigation indicates this is not a problem with art's scheduling of calling endRun. This is a problem with G4's thread-local storage not being cleaned up appropriately when zero events have been processed by the Mu2eG4MT module. I do not yet have a solution to this problem, but will continue to look for options.

Because this is no longer an art issue, I am recategorizing this as a Support issue.

knoepfel commented 2 years ago

Comment by @kutschke on 2021-09-23 20:34:42


Dave: when you did this were your kerberos tickets and voms proxy both still alive: https://mu2ewiki.fnal.gov/wiki/DataTransfer#xrootd

I can reproduce the issue if I don't have a voms proxy but it goes away if I get a proxy.

( auto correct insists that voms is spelled moms or vows - it took 3 tries to convince it otherwise .... )

knoepfel commented 2 years ago

Comment by @goodenou on 2021-09-23 21:05:45


Thank you Rob for noticing the file access using xrootd. That was indeed the issue that was producing the exception shown above.

However, after switching to NFS for file access, and running the non-MT job, I do sometimes get a seg fault. So, we have an intermittent problem on our hands.

I will try running with the MT module, but I assume that the issue will be present there as well.

knoepfel commented 2 years ago

Comment by @kutschke on 2021-09-27 20:36:43


Dave replied to this off of the ticket. I am adding his reply here. The bottom line is that I completely misunderstood - the problem is not the initial error. The problem is that art's return code is confusing. Kyle did you understand that? I missed it. Dave's reply:

Hi Rob, yes the problem accessing the file goes away with a voms proxy. Running without a proxy was just a way of precipitating an error in the 2ndary input stream, as noted in the ticket it is artificial but isn't the source of the issue raised in the ticket, ie that the process return code isn't the art return code. I'm sure there are other ways to induce that error.

knoepfel commented 2 years ago

Comment by @knoepfel on 2021-09-27 21:07:03


Yes, I understand the issue--art reports completing with status 1 due to an exception throw, and a segfault then occurs, resulting in a "status" that is different than 1.

The problem is that the segmentation violation is occurring at static destruction time--i.e. 'int main()' has already completed with a return value of 1, but because of inadequate cleanup of thread-local statics in G4 (in particular, the case where the Mu2eG4MT module processes no events but still sets up the G4 MT infrastructure), you get the segfault after the program (proper) has completed. There's not a whole lot we can do there in terms of the return code--art has already completed by that time. The thread-local statics come from G4 code (through Mu2eG4MT), which is why I changed this from a bug report to a support request. Of course, this is already a broken workflow (due to the exception), and the segfault-vs-return-code issue is primarily a bookkeeping problem--annoying, yes, but fixing this will not turn a broken workflow into a functional one.

In principle, any error during static destruction time can always conflict with the return code of 'int main()'. This is an argument for not printing out the return code at the end of a job and just relying on the actual return code of the executable. That's a discussion with the stakeholders, though.

Bottomline: the segfault is causing the problem and it should be fixed. The Mu2eG4MT module and its G4 connections need to be made robust against situations where an exception throw can lead to the module processing zero events.

goodenou commented 2 years ago

This problem has now been resolved in the Mu2e Offline code. See PR #700 https://github.com/Mu2e/Offline/pull/700 for full details. At a very basic level, the problem was occurring because when there is a failure in the job before the G4MT code calls produce for the first time that causes the job to shutdown, the main thread, rather than the Master Run Manager thread, calls the destructor for all of the geometry objects. There is some instance data associated with each G4VPlacement object that is not accessible by the main thread, only by the Master Run Manager.

knoepfel commented 2 years ago

Thanks, @goodenou. We will then close this issue.

goodenou commented 2 years ago

Great, thanks!

On Feb 8, 2022, at 12:48 PM, Kyle Knoepfel @.***> wrote:

Thanks, @goodenou https://github.com/goodenou. We will then close this issue.

— Reply to this email directly, view it on GitHub https://github.com/art-framework-suite/art/issues/112#issuecomment-1032946158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE44QBQ7ZXMXT7B5D5TJ6Q3U2FQP7ANCNFSM5G5IWGZA. You are receiving this because you were mentioned.