JeffersonLab / hps-java

HPS reconstruction and analysis framework in Java
2 stars 10 forks source link

Catch "Matrix is singular" Runtime Exception #243

Closed normangraf closed 6 years ago

normangraf commented 7 years ago

"Matrix is singular" Runtime Exception is causing the reconstruction to abort. This exception needs to be caught and handled on an event basis, not a run partition file basis.

normangraf commented 7 years ago

This runtime exception seems always to be preceded by a warning message from WTrack, namely:

WTrack: this track started to go backwards?! params [WTrack params [NaN, NaN, NaN, 0.02215501332857945, NaN, NaN, NaN, ]

Should simply kill this track after issuing warning. Should ultimately resolve the root cause of this. Perhaps due to looser cuts on strategies picking up loopers.

normangraf commented 7 years ago

Should check whether the WTrack warning ever leads to a successful fit to make sure we don't lose good tracks.

normangraf commented 7 years ago

I have found an event on which this Exception is thrown. I then skimmed this event and the Exception is NOT thrown on this single event. So I skimmed a few extra events both before and after. What is most curious is that the behavior of the reconstruction depends on how many events I process prior to getting to this event! A file containing 10 events can be found online at: http://www.lcsim.org/test/hps-java/problemFiles/matrixSingular_5772_10events.evio

Here is my command (running with the latest git master snapshot):

java -cp ~/git/hps-java/distribution/target/hps-distribution-4.0-SNAPSHOT-bin.jar org.hps.evio.EvioToLcio -r -x /org/hps/steering/recon/EngineeringRun2015FullRecon.lcsim -d HPS-EngRun2015-Nominal-v6-0-fieldmap -D outputFile=tmp matrixSingular_5772_10events.evio -e 1

This command results in the "Matrix is singular" Exception being thrown on event 79267165.

FullReco_skip0.txt

If I skip two events, viz.

java -cp ~/git/hps-java/distribution/target/hps-distribution-4.0-SNAPSHOT-bin.jar org.hps.evio.EvioToLcio -r -x /org/hps/steering/recon/EngineeringRun2015FullRecon.lcsim -d HPS-EngRun2015-Nominal-v6-0-fieldmap -D outputFile=tmp matrixSingular_5772_10events.evio -e 1 -s 2

The event is processed just fine and the command runs to completion.

FullReco_skip2.txt

I have no idea what is going on.

I would appreciate it if others could download the file and see if this is reproducible.

normangraf commented 7 years ago

I have modified GBLRefitterDriver and MakeGblTracks to simply skip tracks for which the refit would fail.

I have run this over the file mentioned above and it successfully processed event 79267165, viz:

[INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267047; time: 1431858526254520128; seq: 0 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267060; time: 1431858526255245688; seq: 1 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267075; time: 1431858526256018160; seq: 2 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267081; time: 1431858526256800012; seq: 3 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267098; time: 1431858526257510888; seq: 4 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267113; time: 1431858526258217696; seq: 5 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267165; time: 1431858526261200552; seq: 6 WTrack: this track started to go backwards?! params [WTrack params [NaN, NaN, NaN, 0.023117064077144367, NaN, NaN, NaN, ] with corresponding HelicalTrackFit: HelicalTrackFit: d0= 106.85803180556117 phi0= -0.8142691936809715 curvature: -0.003113146091491068 z0= 1.0396211176748624 tanLambda= -0.020735453823318976 ] Can't find track intercept; aborting Track refit [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267189; time: 1431858526262687376; seq: 7 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267220; time: 1431858526264155672; seq: 8 [INFO] [org.lcsim.job.EventPrintLoopAdapter] event: 79267297; time: 1431858526268618396; seq: 9 [INFO] [org.hps.evio] Last physics event time: 1431858526 - Sun May 17 03:28:46 PDT 2015 EventFlagFilter Summary: events processed = 10 events passed = 9 rejection = 0.9 [INFO] [org.hps.evio] Job finished successfully!

normangraf commented 7 years ago

I have successfully run the EngRun2015*ReconTest integrated tests.

normangraf commented 7 years ago

I am running over the 48 unblinded evio partitions from run 5772. This may take a while to complete.

normangraf commented 7 years ago

Resolved and merged with pull request 244.

mdiamon commented 6 years ago

Current theory: the error is a result of a TrackUtils.getHelixPlaneIntercept() failure (https://github.com/JeffersonLab/hps-java/pull/244/files#diff-9a345604cc2a44f1bc20ebbd00f53ec7L250) -- more precisely, a failure in the WTrack method getHelixAndPlaneIntercept() https://github.com/JeffersonLab/hps-java/blob/master/tracking/src/main/java/org/hps/recon/tracking/WTrack.java#L269 which TrackUtils.getHelixPlaneIntercept() calls.

This TrackUtils.getHelixPlaneIntercept method is called in several places in the code. One is in MultipleScattering.java, where there is a cryptic little piece of code that skips the call under certain conditions (presumably because the method would fail): // TODO Catch special cases where the incidental iteration procedure seems to fail if (Math.abs(helix.R()) < 2000 && Math.abs(helix.dca()) > 10.0) https://github.com/JeffersonLab/hps-java/blob/master/tracking/src/main/java/org/hps/recon/tracking/MultipleScattering.java#L328 But this little piece of code doesn't exist in the other places, which is probably when the error happens.

Re-opening this issue to properly deal with the root cause. Options:

  1. (Preferable) Alter the TrackUtils.getHelixPlaneIntercept method to make it more robust, so that it doesn't fail in the first place.
  2. (Backup plan) Implement a self-consistent strategy, across all places where TrackUtils.getHelixPlaneIntercept is called, to catch the failures.
JeremyMcCormick commented 6 years ago

Believe this is resolved in 4.0.1 milestone.

mdiamon commented 6 years ago

Re-opening, with branch iss243 to actually eliminate the source of the errors rather than just catching them.

JeremyMcCormick commented 6 years ago

Resolved by #268.