art-framework-suite / art

The implementation of the art physics event processing framework
Other
2 stars 7 forks source link

art exception in one path suppressed by successful execution of another path #113

Closed knoepfel closed 2 years ago

knoepfel commented 3 years ago

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/26339 (FNAL account required) Originally created by @rlcee on 2021-10-12 19:57:14


When running a mu2e exe with many paths, one path threw an exception, but instead of the exe exiting, the path was ended for the event and the other paths continued. This seems to be the FailPath behavior, but we do not know how this came about. Our settings should default to always rethrow. Kyle has the recipe to reproduce the effect.


Additional details

From Ray:

source /cvmfs/mu2e.opensciencegrid.org/setupmu2e-art.sh
muse setup /mu2e/app/users/rlc/db
mu2e --trace -s /mu2e/app/users/rlc/db/dts.tester.CeEndpoint.Test.1.art -c /mu2e/app/users/rlc/failpath.fcl >& temp.log

then look for the following pattern. The DbEngine start-up throws an exception, and you can see we don't get a finished message for that module, the indent changes, I think showing it is on a different path, etc

++++++++ finished for event: CaloClusterFast
++++++++ module for event: TTmakeSH
DbEngine::beginJob start
DbEngine::beginJob exit early, purpose=EMPTY
++++++++++ module for event: cprLowPSeedDeMEventPrescale
knoepfel commented 3 years ago

Comment by @knoepfel on 2021-10-18 14:12:41


Ray, is this the exception you expect to see?

%MSG-s ArtException:  PostEndJob 18-Oct-2021 09:12:21 CDT ModuleEndJob
---- EventProcessorFailure BEGIN
  EventProcessor: an exception occurred during current event processing
  ---- ScheduleExecutionFailure BEGIN
    Path: ProcessingStopped.
    ---- OtherArt BEGIN
      ---- DBHANDLE_NO_TID BEGIN
        DbHandle could not get TID (Table ID) from DbEngine for TrkAlignTracker at first use The above exception was thrown while processing module StrawHitReco/TTmakeSH run: 501 subRun: 0 event: 1
      ---- DBHANDLE_NO_TID END
    ---- OtherArt END
    Exception going through path cprLowPSeedDeM_trigger
  ---- ScheduleExecutionFailure END
---- EventProcessorFailure END
%MSG
Art has completed and will exit with status 1.
knoepfel commented 3 years ago

Comment by @rlcee on 2021-10-18 14:40:44


Yes, that is the exception which was not being properly rethrown

knoepfel commented 3 years ago

Comment by @knoepfel on 2021-10-18 14:48:11


Thanks, Ray. The exception appears to be suppressed whenever a path successfully completes after the path with the exception executes. A minimal reproducer is:

physics: {
  producers: {
    intProducer: {
      module_type: IntProducer
      ivalue: 4
    }
    failingProducer: {
      module_type: FailingProducer
    }
  }
  tp1: [intProducer]
  tp2: [failingProducer]
}

where intProducer just produces a number, and failingProducer always throws an exception. Whenever tp2 completes first, the exception raised by tp1 is suppressed. Whenever tp1 is executed first, the exception from tp2propagates to the top, ending the job with a return value of 1. The latter is the expected behavior, which should happen regardless of the path-execution order.

Investigating the cause.

knoepfel commented 3 years ago

Comment by @knoepfel on 2021-10-19 16:49:54


The problem is understood--the exception system was respecting an exception only if it was generated by the last path to be processed in the event (which is slightly different than I reported above). The following commit ensures that the exception is "sticky"--i.e. an exception generated by any path is rethrown by the framework unless an exception-handling action has been configured for the job.

Please let us know when you require a new version of art with this fix.

Implemented via https://github.com/art-framework-suite/art/commit/367881770cdc7a8712be6111c8378addf4293211.

knoepfel commented 3 years ago

Comment by @kutschke on 2021-10-19 18:34:30


Thanks Kyle. We will discuss our needs for art upgrades at the weekly meeting tomorrow and get back to you after that. Here is what I know so far.

The main reason to upgrade is to incorporate our request for a new root feature that is now in the branch v6-24-00-patches. The PR is: https://github.com/root-project/root/pull/9049 . Is this in the right place in the root repo structure for you to make a new root for us and build a new art based on that.

When you are ready to go ahead with that we would like you to incorporate two additional fixes, this one and art-framework-suite/fhicl-cpp#8

Do you have other work that is ready to include in a new art version?

I will get back to you tomorrow if it is just these 3 things or if there are additional requests.

knoepfel commented 3 years ago

@kutschke, a few items:

Once we understand the timescale of the ROOT solution better, we'll be able to determine when to upgrade to a new ROOT release. Stay tuned.