cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

NAN in run 3 PF candidate dz leading to NAN in run 3 DeepMET #44976

Closed steggema closed 4 months ago

steggema commented 4 months ago

Checking the DeepMET performance in run 3, we found a significant fraction of events for which DeepMET evaluates to NAN, up to 1% in a four-top sample. We've just traced this down to events where we have a PF candidate where dz() returns NAN, which we don't catch https://github.com/cms-sw/cmssw/blob/master/RecoMET/METPUSubtraction/plugins/DeepMETProducer.cc#L85

The fix will be easy enough (just checking for NAN), but I was wondering if dz() returning NAN for a PF candidate is expected or if it could impact something else.

Tagging co-developers (@yongbinfeng @mseidel42) and whoever i could easily find on github from JME (@alkaloge) PF (@swagata87) Reco (@mandrenguyen)

cmsbuild commented 4 months ago

cms-bot internal usage

cmsbuild commented 4 months ago

A new Issue was created by @steggema.

@rappoccio, @makortel, @antoniovilela, @smuzaffar, @sextonkennedy, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

swagata87 commented 4 months ago

FYI @cms-sw/jetmet-pog-l2 @cms-sw/pf-l2 @stahlleiton @cms-sw/reconstruction-l2

swagata87 commented 4 months ago

assign reconstruction

cmsbuild commented 4 months ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

swagata87 commented 4 months ago

type jetmet, pf

swagata87 commented 4 months ago

what are the ParticleTypes for which NaN is obtained? In which years of Run3 is this happening? and it will be useful to have a recipe where one can reproduce the issue and can debug.

kdlong commented 4 months ago

Are the full kinematics nan or just dz? We had some instances where problematic tracks led to this issue last year, but it's extremely difficult to debug to a general class of problems. I would start by excluding them from the deep met.

swagata87 commented 4 months ago

We had some instances where problematic tracks led to this issue last year,

right, thanks, it could be related or similar, here is the documentation of that issue which Kenneth had debugged https://github.com/cms-sw/cmssw/issues/39110

steggema commented 4 months ago

To answer the questions in one go:

To reproduce: A few events with the issue can be obtained with edmCopyPickMerge outputFile=pickevents.root eventsToProcess=1:4256897,1:4257026,1:4257054,1:4418151,1:4418165,1:4418588 inputFiles=/store/mc/Run3Summer22MiniAODv4/DYJetsToLL_M-50_TuneCP5_13p6TeV-madgraphMLM-pythia8/MINIAODSIM/ALCARECO_forPOG_130X_mcRun3_2022_realistic_v5_ext1-v2/2530000/3f52a439-eb16-46f8-b9ab-0b1dd30da09a.root

steggema commented 4 months ago

I looked a bit more at the "bestTrack" (when it exists), and they look really bizarre to me (just looking at the handful of events i mentioned above, i can look at more if people are curious):

swagata87 commented 4 months ago

It does indeed look related to the other issue in the sense that all kinematic values are zero. Would the fix also address this? Are there samples available with the fix applied? (e.g. would we expect to not see this issue in 2023 or 2024 MC)

can you check in 2023 or 2024 MC if the issue is still present or already addressed by Kenneth's fix?

steggema commented 4 months ago

It does indeed look related to the other issue in the sense that all kinematic values are zero. Would the fix also address this? Are there samples available with the fix applied? (e.g. would we expect to not see this issue in 2023 or 2024 MC)

can you check in 2023 or 2024 MC if the issue is still present or already addressed by Kenneth's fix?

The issue still appears to be there in 2023 and 2024 MC. The two samples I checked are /DYto2TautoMuTauh_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter24NanoAOD-133X_mcRun3_2024_realistic_v9-v2/NANOAODSIM and /TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer23NanoAODv12-130X_mcRun3_2023_realistic_v14-v2/NANOAODSIM - they seemed to be the most recent DY/TT samples I could find but let me know if there are any more recent samples I should check.

swagata87 commented 4 months ago

I looked a bit more at the "bestTrack" (when it exists), and they look really bizarre to me (just looking at the handful of events i mentioned above, i can look at more if people are curious):

* chi2 is 0, ndof in the 5 to 10 range, and they are high purity (so they go to PF)

* pT is around 1 GeV, |eta| around 2.7

* pT error is inf

* vx/vy and dxy are finite, vz and dz are inf

Do I understand correctly that these tracks are present in the general-track collections already and being propagated to PF? Let me also tag @cms-sw/tracking-pog-l2 as it can be related to track reconstruction.

steggema commented 4 months ago

I looked a bit more at the "bestTrack" (when it exists), and they look really bizarre to me (just looking at the handful of events i mentioned above, i can look at more if people are curious):

* chi2 is 0, ndof in the 5 to 10 range, and they are high purity (so they go to PF)

* pT is around 1 GeV, |eta| around 2.7

* pT error is inf

* vx/vy and dxy are finite, vz and dz are inf

Do I understand correctly that these tracks are present in the general-track collections already and being propagated to PF? Let me also tag @cms-sw/tracking-pog-l2 as it can be related to track reconstruction.

To be very clear, I should say that I only looked at MiniAOD. There the "bestTrack" method returns tracks with the features that I mentioned. I would assume that this implies that they should be in the general-tracks collections and be high-purity since they make it into PF, but it's of course conceivable that something else happens. I think the AOD sample that corresponds to the MINIAOD one above is available so it should be possible to get the events and look into the collections there. I'll see if I can easily do it and will report back in the thread.

steggema commented 4 months ago

Turns out that AODSIM isn't available... I actually can't easily find a run 3 sample that has both MiniAODSIM and AODSIM available - if someone can point me to such a sample, I'm happy to check.

swagata87 commented 4 months ago

Turns out that AODSIM isn't available... I actually can't easily find a run 3 sample that has both MiniAODSIM and AODSIM available - if someone can point me to such a sample, I'm happy to check.

@cms-sw/pdmv-l2 can you help finding a AODSIM for run3 for checking the above mentioned issue for which we need to access general track collection?

And @steggema just out of curiosity, we expect this issue to also show up in Run3 data, right? If any example of data event is found, kindly paste the event pointer here.

sunilUIET commented 4 months ago

Here are the AODSIM for the above two samples /DYto2TautoMuTauh_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Winter24Reco-133X_mcRun3_2024_realistic_v9-v2/AODSIM /TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer23DRPremix-130X_mcRun3_2023_realistic_v14-v2/AODSIM

In any case, AODSIM should be available by default.

steggema commented 4 months ago

Thanks, the DY->tau tau-> mu tauh sample indeed has AODSIM available on disk (the other doesn't), so I was able to run a quick check with it.

What I did is:

My main finding is that these tracks have finite dz, vz, and ptError values (whereas they are nan, nan, and inf for the bestTracks). I print some more information below (*) if it helps. I would conclude that it means that something is going wrong in how these tracks are being used within the PF algorithm. Please let me know if there's anything else that I can do to help.

(*) ######## Event 6 pt 1.00 eta -2.84 phi 0.53 dz 0.49 dxy -0.04 found 5 quality 2 True chi2 4.69 ndof 5.0 pt error 0.82 vz 0.22 algo duplicateMerge ######## Event 24 pt 1.09 eta -2.89 phi -2.33 dz 5.62 dxy 0.09 found 5 quality 2 True chi2 4.93 ndof 5.0 pt error 0.99 vz 5.79 algo highPtTripletStep ######## Event 72 pt 0.98 eta -2.66 phi 0.61 dz -1.64 dxy -0.04 found 5 quality 2 True chi2 2.88 ndof 5.0 pt error 0.94 vz -1.84 algo highPtTripletStep ######## Event 101 pt 1.02 eta -2.64 phi 0.90 dz -0.30 dxy 0.02 found 5 quality 2 True chi2 5.39 ndof 5.0 pt error 0.99 vz -0.40 algo highPtTripletStep ######## Event 103 pt 1.01 eta 2.64 phi -2.21 dz -1.46 dxy 0.07 found 5 quality 2 True chi2 7.45 ndof 5.0 pt error 0.85 vz -1.55 algo highPtTripletStep ######## Event 125 pt 1.16 eta 2.84 phi -0.23 dz -1.96 dxy -0.01 found 5 quality 2 True chi2 3.28 ndof 5.0 pt error 1.00 vz -1.55 algo highPtTripletStep ######## Event 136 pt 0.98 eta -2.73 phi -0.58 dz 0.75 dxy -0.01 found 5 quality 2 True chi2 3.43 ndof 5.0 pt error 0.98 vz 0.38 algo highPtTripletStep ######## Event 152 pt 0.97 eta -2.73 phi 0.07 dz 0.51 dxy -0.08 found 5 quality 2 True chi2 9.79 ndof 5.0 pt error 0.91 vz 0.17 algo highPtTripletStep ######## Event 164 pt 1.01 eta -2.71 phi -0.34 dz -1.15 dxy -0.01 found 5 quality 2 True chi2 3.26 ndof 5.0 pt error 0.99 vz -1.51 algo highPtTripletStep ######## Event 181 pt 0.99 eta 2.85 phi 2.04 dz -3.25 dxy 0.01 found 5 quality 2 True chi2 8.43 ndof 5.0 pt error 0.97 vz -3.56 algo duplicateMerge ######## Event 195 pt 0.97 eta 2.71 phi -2.37 dz -4.02 dxy 0.03 found 5 quality 2 True chi2 1.70 ndof 5.0 pt error 0.84 vz -4.18 algo highPtTripletStep

steggema commented 4 months ago

Sorry for the multiple posts, but I realised I could of course also simply look at the PFCandidates in AODSIM. Interestingly, they have finite dzError and finite dz of the bestTrack, but they already have all kinematic values at 0. I'm dumping some more information just below in case it helps for the first 20 or so candidates.

######## Event 0 Found PF candidate with zero pt: pt=0.00 best track pt=0.62 eta=0.00 phi=0.00 dz=3.37 dzError=0.24 dxy=-0.01 dxyError=0.17 pdgId=-211 vz=2.98 algo=duplicateMerge ######## Event 1 Found PF candidate with zero pt: pt=0.00 best track pt=0.57 eta=0.00 phi=0.00 dz=0.70 dzError=0.28 dxy=0.09 dxyError=0.21 pdgId=211 vz=0.99 algo=highPtTripletStep ######## Event 2 Found PF candidate with zero pt: pt=0.00 best track pt=0.59 eta=0.00 phi=0.00 dz=2.28 dzError=0.20 dxy=0.03 dxyError=0.15 pdgId=-211 vz=1.86 algo=highPtTripletStep ######## Event 3 Found PF candidate with zero pt: pt=0.00 best track pt=0.55 eta=0.00 phi=0.00 dz=9.69 dzError=1.05 dxy=0.11 dxyError=0.23 pdgId=-211 vz=9.91 algo=highPtTripletStep ######## Event 4 Found PF candidate with zero pt: pt=0.00 best track pt=0.58 eta=0.00 phi=0.00 dz=-2.69 dzError=0.15 dxy=-0.00 dxyError=0.24 pdgId=-211 vz=-2.35 algo=highPtTripletStep ######## Event 5 Found PF candidate with zero pt: pt=0.00 best track pt=0.85 eta=0.00 phi=0.00 dz=-1.44 dzError=0.10 dxy=0.00 dxyError=0.11 pdgId=-211 vz=-1.47 algo=highPtTripletStep ######## Event 6 Found PF candidate with zero pt: pt=0.00 best track pt=1.00 eta=0.00 phi=0.00 dz=0.49 dzError=2.15 dxy=-0.04 dxyError=0.28 pdgId=211 vz=0.22 algo=duplicateMerge ######## Event 7 Found PF candidate with zero pt: pt=0.00 best track pt=0.91 eta=0.00 phi=0.00 dz=-0.06 dzError=0.17 dxy=-0.03 dxyError=0.10 pdgId=-211 vz=-0.39 algo=highPtTripletStep ######## Event 8 Found PF candidate with zero pt: pt=0.00 best track pt=0.63 eta=0.00 phi=0.00 dz=5.82 dzError=0.15 dxy=-0.01 dxyError=0.17 pdgId=-211 vz=5.47 algo=highPtTripletStep ######## Event 9 Found PF candidate with zero pt: pt=0.00 best track pt=0.74 eta=0.00 phi=0.00 dz=0.71 dzError=1.37 dxy=0.01 dxyError=0.22 pdgId=-211 vz=0.97 algo=highPtTripletStep ######## Event 10 Found PF candidate with zero pt: pt=0.00 best track pt=0.80 eta=0.00 phi=0.00 dz=-0.14 dzError=0.49 dxy=-0.01 dxyError=0.16 pdgId=-211 vz=-0.10 algo=highPtTripletStep ######## Event 11 Found PF candidate with zero pt: pt=0.00 best track pt=0.45 eta=0.00 phi=0.00 dz=7.64 dzError=1.37 dxy=-0.03 dxyError=0.34 pdgId=-211 vz=7.80 algo=duplicateMerge ######## Event 12 Found PF candidate with zero pt: pt=0.00 best track pt=0.70 eta=0.00 phi=0.00 dz=-2.26 dzError=1.74 dxy=0.02 dxyError=0.22 pdgId=-211 vz=-2.53 algo=duplicateMerge ######## Event 13 Found PF candidate with zero pt: pt=0.00 best track pt=0.89 eta=0.00 phi=0.00 dz=-1.81 dzError=1.07 dxy=0.02 dxyError=0.17 pdgId=211 vz=-2.00 algo=highPtTripletStep ######## Event 14 Found PF candidate with zero pt: pt=0.00 best track pt=0.44 eta=0.00 phi=0.00 dz=-1.45 dzError=0.37 dxy=0.08 dxyError=0.45 pdgId=-211 vz=-1.47 algo=highPtTripletStep ######## Event 15 Found PF candidate with zero pt: pt=0.00 best track pt=0.78 eta=0.00 phi=0.00 dz=1.16 dzError=0.34 dxy=-0.05 dxyError=0.11 pdgId=-211 vz=1.55 algo=highPtTripletStep ######## Event 16 Found PF candidate with zero pt: pt=0.00 best track pt=0.71 eta=0.00 phi=0.00 dz=-2.75 dzError=0.15 dxy=0.02 dxyError=0.17 pdgId=211 vz=-2.48 algo=highPtTripletStep ######## Event 17 Found PF candidate with zero pt: pt=0.00 best track pt=0.85 eta=0.00 phi=0.00 dz=-2.15 dzError=1.41 dxy=-0.03 dxyError=0.18 pdgId=-211 vz=-2.35 algo=duplicateMerge ######## Event 18 Found PF candidate with zero pt: pt=0.00 best track pt=0.55 eta=0.00 phi=0.00 dz=-1.17 dzError=0.92 dxy=0.10 dxyError=0.21 pdgId=-211 vz=-0.85 algo=highPtTripletStep ######## Event 19 Found PF candidate with zero pt: pt=0.00 best track pt=0.53 eta=0.00 phi=0.00 dz=3.07 dzError=0.89 dxy=0.01 dxyError=0.23 pdgId=211 vz=3.07 algo=highPtTripletStep ######## Event 20 Found PF candidate with zero pt: pt=0.00 best track pt=0.79 eta=0.00 phi=0.00 dz=-8.82 dzError=0.47 dxy=0.05 dxyError=0.13 pdgId=-211

swagata87 commented 4 months ago

thanks, interesting.. looks like something went wrong in AOD->miniAOD step

steggema commented 4 months ago

Agreed. For what it's worth, it seems plausible to me that some of the packed candidate creation or methods rely on momenta not being zero, so maybe the right question to ask is what is the purpose of having a PF candidate with zero momentum in the first place...

kdlong commented 4 months ago

We had this discussion in the past. In principle zero momentum means that the kinematics aren't reliably estimated to be different than zero within uncertainties. But they can still play a role in candidate counting etc. I wouldn't say they are super useful, but it also wasn't obvious to me that we should kill them. Regardless we should avoid that zeros turn to nans, there's no world in which that is desirable.

makortel commented 4 months ago

FWIW, at least these lines in packVtx() https://github.com/cms-sw/cmssw/blob/d376eb350a3e9af9f0ce3618fbac6afc36ce56d8/DataFormats/PatCandidates/src/PackedCandidate.cc#L42-L43 will result NaN with zero momentum track.

stahlleiton commented 4 months ago

If I recall correctly, I think the track properties are not stored in the packed PF candidate produced by the PAT packed candidate producer if the pt < 0.95 GeV

AdrianoDee commented 4 months ago

This was lowered to 0.0 for Run3 MINIAOD (with low quality covariance matrix).

stahlleiton commented 4 months ago

This was lowered to 0.0 for Run3 MINIAOD (with low quality covariance matrix).

Ah yes, but only if it strictly larger than 0 GeV (so candidates with pT = 0 would then have no track properties)

swagata87 commented 4 months ago

FWIW, at least these lines in packVtx()

https://github.com/cms-sw/cmssw/blob/d376eb350a3e9af9f0ce3618fbac6afc36ce56d8/DataFormats/PatCandidates/src/PackedCandidate.cc#L42-L43

will result NaN with zero momentum track.

thanks Matti

this[1] change seems to help;

Examples from /eos/cms//store/mc/Run3Winter24Reco/DYto2TautoMuTauh_M-50_TuneCP5_13p6TeV_madgraphMLM-pythia8/AODSIM/133X_mcRun3_2024_realistic_v9-v2/2830000/00502d01-cccd-453a-9f6d-160572a6b2ae.root after running AOD->miniAOD step for 200 evets; and looping over packedPFCandidates:

Before change [1]

dz, dxy, eta, pt  nan -0.06972656399011612 0.0 0.0
dz, dxy, eta, pt  nan -0.014941406436264515 0.0 0.0

after change [1]

dz, dxy, eta, pt  5.226568222045898 -0.06972656399011612 0.0 0.0
dz, dxy, eta, pt  5.175044059753418 -0.014941406436264515 0.0 0.0

[1]

--- a/DataFormats/PatCandidates/src/PackedCandidate.cc
+++ b/DataFormats/PatCandidates/src/PackedCandidate.cc
@@ -40,7 +40,9 @@ void pat::PackedCandidate::packVtx(bool unpackAfterwards) {
   // float dl = dxPV * c + dyPV * s;
   // float xRec = - dxy_ * s + dl * c, yRec = dxy_ * c + dl * s;
   float pzpt = p4_.load()->Pz() / p4_.load()->Pt();
-  dz_ = vertex_.load()->Z() - pv.Z() - (dxPV * c + dyPV * s) * pzpt;
+  if (!std::isnan(pzpt)) {
+    dz_ = vertex_.load()->Z() - pv.Z() - (dxPV * c + dyPV * s) * pzpt;
+  }
makortel commented 4 months ago
--- a/DataFormats/PatCandidates/src/PackedCandidate.cc
+++ b/DataFormats/PatCandidates/src/PackedCandidate.cc
@@ -40,7 +40,9 @@ void pat::PackedCandidate::packVtx(bool unpackAfterwards) {
   // float dl = dxPV * c + dyPV * s;
   // float xRec = - dxy_ * s + dl * c, yRec = dxy_ * c + dl * s;
   float pzpt = p4_.load()->Pz() / p4_.load()->Pt();
-  dz_ = vertex_.load()->Z() - pv.Z() - (dxPV * c + dyPV * s) * pzpt;
+  if (!std::isnan(pzpt)) {
+    dz_ = vertex_.load()->Z() - pv.Z() - (dxPV * c + dyPV * s) * pzpt;
+  }

I'd suggest to check e.g. p4_.load()->Pt() != 0.f (or also Pz() != 0.f depending if Inf is acceptable or not) directly instead of allowing NaN to be computed. On a quick look I also didn't see any other place that would initialize dz_, so I'd suggest to add explicit initialization somewhere.

swagata87 commented 4 months ago

okay, how about this? with this change, the nan issue is solved, some before/after numbers are pasted below,

-  float pzpt = p4_.load()->Pz() / p4_.load()->Pt();
-  dz_ = vertex_.load()->Z() - pv.Z() - (dxPV * c + dyPV * s) * pzpt;
+  dz_ = 0;
+  if (p4_.load()->Pt() != 0.f) {
+    float pzpt = p4_.load()->Pz() / p4_.load()->Pt();
+    dz_ = vertex_.load()->Z() - pv.Z() - (dxPV * c + dyPV * s) * pzpt;
+  }

before the change

dz, dxy, eta, pt  nan -0.06972656399011612 0.0 0.0
dz, dxy, eta, pt  nan -0.014941406436264515 0.0 0.0
dz, dxy, eta, pt  nan 0.08945312350988388 0.0 0.0
dz, dxy, eta, pt  nan -0.0037353516090661287 0.0 0.0
dz, dxy, eta, pt  nan -0.022695312276482582 0.0 0.0
dz, dxy, eta, pt  nan 0.0036865235306322575 0.0 0.0
dz, dxy, eta, pt  nan 0.009384765289723873 0.0 0.0
dz, dxy, eta, pt  nan 0.07921875268220901 0.0 0.0
dz, dxy, eta, pt  nan -0.0006768798921257257 0.0 0.0

after the change

dz, dxy, eta, pt  5.226568222045898 -0.06972656399011612 0.0 0.0
dz, dxy, eta, pt  5.175044059753418 -0.014941406436264515 0.0 0.0
dz, dxy, eta, pt  2.9320716857910156 0.08945312350988388 0.0 0.0
dz, dxy, eta, pt  12.334485054016113 -0.0037353516090661287 0.0 0.0
dz, dxy, eta, pt  16.324251174926758 -0.022695312276482582 0.0 0.0
dz, dxy, eta, pt  -0.19352594017982483 0.0036865235306322575 0.0 0.0
dz, dxy, eta, pt  3.242164134979248 0.009384765289723873 0.0 0.0
dz, dxy, eta, pt  1.036512017250061 0.07921875268220901 0.0 0.0
dz, dxy, eta, pt  3.9245831966400146 -0.0006768798921257257 0.0 0.0
makortel commented 4 months ago

okay, how about this?

Looks reasonable to me, thanks.

swagata87 commented 4 months ago

okay, I'm doing some more sanity checks, then I'll open a PR

swagata87 commented 4 months ago

so its merged in 14_1 https://github.com/cms-sw/cmssw/pull/45009 do we need any backport? I guess a backport to 14_0 can be useful as that's the release for 2024 data-taking.

steggema commented 4 months ago

A backport to 14_0 sounds great to me. This should also only make it necessary to backport the DeepMET input check for possible ReNanoAOD campaigns.

jfernan2 commented 4 months ago

+1 Solved by https://github.com/cms-sw/cmssw/pull/45009

cmsbuild commented 4 months ago

This issue is fully signed and ready to be closed.

vlimant commented 4 months ago

Hi @swagata87 , all, something that is a little confusing in this comment https://github.com/cms-sw/cmssw/issues/44976#issuecomment-2119272110 is that you show dz=nan for particle with eta=0 pt=0 (according to the printout) which in turn should have dz=0 after the fix. What are we reading wrong ?

swagata87 commented 4 months ago

It looks like there are multiple places where dz gets updated after it is first calculated in the part of the code that is changed in the PR. Previously, when dz=NaN was calculated at first, it remained NaN after the subsequent updates. But now, we are getting dz=0 at first, in subsequent steps it gets changed to some other value.