cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Various warning messages from `MkFitOutputConverter` #35798

Closed mmusich closed 2 years ago

mmusich commented 2 years ago

After integration of mkFit we observe a large number of warnings coming from MkFitOutputConverter of the type:

%MSG-w MkFitOutputConverter:   MkFitOutputConverter:detachedTripletStepTrackCandidates  23-Oct-2021 10:11:25 CEST Run: 1 Event: 10
Curvilinear error not pos-def
[         -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan         nan         nan
          -nan        -nan        -nan         nan         nan ]
candidate ignored
%MSG

and:

%MSG-w MkFitOutputConverter:   MkFitOutputConverter:highPtTripletStepTrackCandidates  23-Oct-2021 10:11:25 CEST Run: 1 Event: 10
Backward fit of candidate 10 failed, ignoring the candidate
%MSG

examples:

wf. 11634.7 step3

wf 12434.0 step3

mmusich commented 2 years ago

assign reconstruction

cmsbuild commented 2 years ago

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @mmusich Marco Musich.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 2 years ago

@mmasciov @slava77 could you please have a look?

mmusich commented 2 years ago

@VinInn FYI

slava77 commented 2 years ago

Uhm, 136.874 looks even worse

I think that for now we should just change this to LogInfo on the CMSSW side.

jpata commented 2 years ago

According to Javi at T0 in hypernews: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2314/3/1/1/1/1/1/1/1/1/1/1/1.html this is an issue for T0 replays.

In the log file /afs/cern.ch/user/c/cmst0/public/PausedJobs/DQMseq/job_9738/tarball/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log, there's a lot of this

%MSG-w BasicTrajectoryState:  MkFitOutputConverter:pixelLessStepTrackCandidates  12-Nov-2021 01:30:07 CET Run: 317696 Event: 37435960
curv error not pos-def
[  0.000387528  0.00015181 0.000120542-2.72934e-05-9.24547e-07
    0.00015181 0.000101104 5.54339e-05-1.82003e-05-6.04576e-07
   0.000120542 5.54339e-05 4.47304e-05-9.97972e-06-3.35531e-07
  -2.72934e-05-1.82003e-05-9.97972e-06 4.07633e-06 9.49303e-08
  -9.24547e-07-6.04576e-07-3.35531e-07 9.49303e-08 -1.5703e-06 ]

Looks like it's coming from https://github.com/cms-sw/cmssw/blob/2a2ca2ee216a44c2743301dc2dc32b4c7c93f4ad/TrackingTools/TrajectoryState/src/BasicTrajectoryState.cc#L89

mmasciov commented 2 years ago

@jpata , the amount of warnings is expected to drop significantly after PR #36075 (backport PR). We have checked explicitly using the same configuration, and the number of warnings after such backport got reduced by ~15x. The source was understood and 'cured' by that PR.

mmusich commented 2 years ago

I just wanted to check if there is a plan for further reduction. By checking step3 of wf 136.874 there is still a bit of:

%MSG-w BasicTrajectoryState:  TrackProducer:pixelLessStepTracks  25-Nov-2021 06:22:51 CET Run: 319450 Event: 106497951
local error not pos-def
[   -0.0215424-1.64885e-05   0.0072324  0.00764519  -0.0222784
  -1.64885e-05  0.00100408 0.000317885   0.0213792-0.000340222
     0.0072324 0.000317885  0.00579943 -0.00439497  -0.0370556
    0.00764519   0.0213792 -0.00439497    0.476544    0.354667
    -0.0222784-0.000340222  -0.0370556    0.354667     6.33425 ]
pos/mom/mf  (-43.8802,-25.4802,208.054)   (-2.05221,-1.02749,8.73586)   (-0.0187514,-0.0108884,3.72672) 
%MSG

that I don't observe that much if I run the CKF counterpart:

cmsDriver.py step3  --conditions auto:run2_data -s RAW2DIGI,L1Reco,RECO,SKIM:ZElectron+SinglePhotonJetPlusHOFilter+EXOMONOPOLE,EI,PAT,ALCA:SiStripCalZeroBias+SiStripCalMinBias+SiStripCalSmallBiasScan+TkAlMinBias+EcalESAlign,DQM:@commonFakeHLT+@ecal+@egamma+@L1TEgamma --datatier RECO,MINIAOD,DQMIO --eventcontent RECO,MINIAOD,DQM --data  --process reRECO --scenario pp --era Run2_2018_noMkFit --customise Configuration/DataProcessing/RecoTLR.customisePostEra_Run2_2018 --hltProcess reHLT -n 100  --filein  file:step2.root  --fileout file:step3.root 

(for the record ~200 vs ~10)

mmasciov commented 2 years ago

@mmusich, yes. PR #36246 halves the remaining warnings already. We will try to reduce them further with minor future technical updates.

perrotta commented 2 years ago

@mmusich, yes. PR #36246 halves the remaining warnings already. We will try to reduce them further with minor future technical updates.

I take one PR randomly, i.e. 136.793, and those TrackProducer:pixelLessStepTracks warnings increase after #36246 from 120 to 159.

In 10224.0 they increase from 0 to 6

@mmasciov how did you conclude that "PR #36246 halves the remaining warnings already" ?

mmasciov commented 2 years ago

@perrotta From step 3 of 136.874, which was originally pointed out by @mmusich. Here, the number of such warnings goes from 170 to 88. We also observed a reduction of warnings when re-testing the data replay job mentioned here.

However, warnings are really track-specific. So, one tracking particle present in one case and not in another can produce an increase of warning messages in the former case. Anyways, as mentioned above we will try to reduce these warnings further with minor future technical updates.

slava77 commented 2 years ago

I had another look at the origins of the remaining non-pos-def warnings, but the examples appeared to be of a similar kind that I noticed shortly after this issue was opened.

The posDef() checks in the *TrajectoryError cover only a simple check for the diagonal elements. So, the checks done in the MkFitOutputConverter are incomplete and some track states after a propagation begin to fail the posDef() checks. (In a local test still) I have added a check for the matrix determinant to be non-negative and this apparently reduced the number of warnings roughly down to the rate seen in the noMkFit rerun of the workflows. I used 20 events in wf 11834 and 100 each in 136.{793,874}.

If the internal validation confirms that there is no loss in efficiency and the computation of the determinant is cheap, I think that we will add a check on the determinant sign in the MkFitOutputConverter soon after #36546 is merged. A better understanding of the origins (or if the bad states can be detected already inside the mkFit routines where the covariance matrix is still a 6x6 [and supposedly det=0] ) will be on a longer time scale.

If someone has some pointers to literature related to preservation of positive-definiteness of the covariance in Kalman updates, please share. This may help resolving the problem at the origins.

jpata commented 2 years ago

+reconstruction

cmsbuild commented 2 years ago

This issue is fully signed and ready to be closed.