SBNSoftware / sbndcode

10 stars 43 forks source link

Mismatch between `EventAuxilary` and `sim::AuxDetHits` killing some jobs #462

Closed bear-is-asleep closed 2 months ago

bear-is-asleep commented 2 months ago

TL;DR

I've noticed some jobs are failing due to a length mismatch in two data products related to the CRT info from GEANT4. I believe this is an upstream issue for the CRT and it will kill jobs if we run into this error. This happened in a grid job with 100 jobs x 20 events/job on a single job, so it's quite rare.

The error

%MSG-s ArtException: PostEndJob 26-Apr-2024 13:21:22 CDT ModuleEndJob ---- EventProcessorFailure BEGIN EventProcessor: an exception occurred during current event processing ---- EventProcessorFailure BEGIN EndPathExecutor: an exception occurred during current event processing ---- ScheduleExecutionFailure BEGIN Path: ProcessingStopped. ---- StdException BEGIN An exception was thrown while processing module LArSoftSuperaDriver/superampvmpr run: 1 subRun: 13 event: 9 std::exception ---- StdException END Exception going through path end_path ---- ScheduleExecutionFailure END ---- EventProcessorFailure END ---- EventProcessorFailure END ---- FatalRootError BEGIN Fatal Root Error: TTree::SetEntries Tree branches have different numbers of entries, eg EventAuxiliary has 8 entries while sim::AuxDetHits_largeant_LArG4DetectorServicevolAuxDetSensitiveCRTStripBERN_G4. has 20 entries. ROOT severity: 2000 ---- FatalRootError END %MSG Art has completed and will exit with status 1.

The file

/exp/sbnd/data/users/brindenc/ML/test_fcl/debug_aux/prodmpvmpr_sbnd_MPVMPR-20240424T172954_G4-20240424T173650_DetSim-20240424T174807_f88068e1-3e18-4b41-9d66-4568851d60d5.root
jzettle commented 2 months ago

I have been trying this with the most recent icaruscode version to test things for running detector systematics and seeing a failure rate of something like 50% or greater with a similar error at the end of the message:

I am running neutrino-only events with the gen and g4 fcls being: simulation_genie_icarus_bnb_volDetEnclosure.fcl and then a filter module filter_genie_active_icarus.fcl for the gen stage to run GENIE in both icarus cryostats and larg4_icarus_cosmics_sce_2d_drift.fcl for the g4 stage when seeing this more frequently than @bear-is-asleep reported. Also now trying nu+cosmics with 15 events/job as a test comparison.

%MSG-s ArtException:  PostEndJob 27-Apr-2024 17:30:08 UTC ModuleEndJob
---- EventProcessorFailure BEGIN
  EventProcessor: an exception occurred during current event processing
  ---- EventProcessorFailure BEGIN
    EndPathExecutor: an exception occurred during current event processing
    ---- ScheduleExecutionFailure BEGIN
      Path: ProcessingStopped.
      ---- StdException BEGIN
        An exception was thrown while processing module LArSoftSuperaDriver/superaNu run: 1 subRun: 1 event: 14
        std::exception
      ---- StdException END
      Exception going through path end_path
    ---- ScheduleExecutionFailure END
  ---- EventProcessorFailure END
---- EventProcessorFailure END
---- FatalRootError BEGIN
  Fatal Root Error: TTree::SetEntries
  Tree branches have different numbers of entries, eg EventAuxiliary has 6 entries while recob::Hits_gaushit1dTPCWE__MCstage0. has 10 e\
ntries.
  ROOT severity: 2000
---- FatalRootError END
%MSG
Art has completed and will exit with status 1.
jzettle commented 2 months ago

Have also been communicating with @yeonjaej on this from the ICARUS ML side of things to try to understand this better

jzettle commented 2 months ago

I am not an expert on Supera, but I think the length mismatch is a red herring in all of this. I am seeing different data products report this issue and I think it has to do with how many events it successfully processes before Supera throws an error and the job exits compared to how many events were in the file initially. I have given all the information I have to the ML group on ICARUS as they will know better than I how to check this.

jzettle commented 2 months ago

Apologies for a bit of spam on this, but for more information I ran a larger statistics neutrino-only test on this and with 100 jobs with 50 events in them, 87 of them failed with this issue and what seems like various points in the running (as in, not always on the first event and 13 jobs complete over all events) so this seems like something within Supera that doesn't like some of these events it is the process that throws an error and exits. I also ran neutrino+cosmics and am also seeing this issue. It still happens with nu+cosmics with 15 events/file where 6 of 15 jobs exit with this issue.

bear-is-asleep commented 2 months ago

Thanks @jzettle for helping dig in! Perhaps this originates in some different way in which we parse data within larcv. I can look and try to identify the area where the error occurs.

bear-is-asleep commented 2 months ago

I'm going to close this since it's not related to the two data products in question.