reco::Candidate pt = -nan in some rare cases

swagata87 commented 2 years ago

This is to keep track of (and hopefully solve at some point) an issue that reco::Candidate pt is -nan in some rare cases. It seems that this problem can lead to egamma object's PF isolation being -nan as well, if the problematic candidate ends up in egamma object's isolation cone.

One such problematic event is in this AOD file: root://xrootd-cms.infn.it//store/mc/Run3Winter22DR/GJet_Pt-10to40_DoubleEMEnriched_TuneCP5_13p6TeV_pythia8/AODSIM/FlatPU0to70_122X_mcRun3_2021_realistic_v9-v2/2430000/6cd37543-62ec-4f62-9fa9-23b7c66f9c20.root, the exact run:event number is: eventsToProcess = cms.untracked.VEventRange('1:78326956-1:78326956'),

If we run AOD->MiniAOD step (I ran in CMSSW_12_2_1) for this event, we get the following warning in the event

%MSG-w PrimaryVertexSorting:   PFCandidatePrimaryVertexSorter:primaryVertexAssociationJME  19-Aug-2022 13:06:05 CEST Run: 1 Event: 78326956
Scaling is NAN ignoring this candidate/track
%MSG

triggered from here: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/CommonTools/RecoAlgos/src/PrimaryVertexSorting.cc#L35-L36

I checked that, in that loop, c->pt() = -nan, and the pdgId is -211. In this event, we have 2 reconstructed gedPhotons, both have chargedHadronIso()=-nan.

Another example of such problematic event is in this GEN-SIM-DIGI-RAW file: /eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root , exact event is: eventsToProcess = cms.untracked.VEventRange('1:36727-1:36727')

Running the RECO step (in CMSSW_12_1_0_pre4) on this event, we get the same warning

%MSG-w PrimaryVertexSorting:   PFCandidatePrimaryVertexSorter:primaryVertexAssociationJME  19-Aug-2022 13:18:45 CEST Run: 1 Event: 36727
Scaling is NAN ignoring this candidate/track
%MSG

In this case, reco::Candidate pt = -nan and its pdgId is 211. And the event has 2 reconstructed ged photons with -nan value of charged hadron isolation.

Next I plan to check if the issue of photon's charged hadron isolation being -nan happen in data also or not.

cmsbuild commented 2 years ago

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

swagata87 commented 2 years ago

thanks to @shdutta16 and @Prasant1993 for bringing this to my notice, this came up while training new photon IDs for Run3.

FYI @cms-sw/pf-l2

slava77 commented 2 years ago

@swagata87 do you see NaNs in the persisted data, or does it appear as a part of running some algorithms (e.g. in or upstream of PFCandidatePrimaryVertexSorter). Having a specific pf candidate with a nan would be more direct to trace.

makortel commented 2 years ago

assign reconstruction

cmsbuild commented 2 years ago

New categories assigned: reconstruction

@jpata,@clacaputo,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

swagata87 commented 2 years ago

Looping over particleFlow candidates (vector<reco::PFCandidate> "particleFlow" "" "RECO") in those problematic events, I see that pt, eta, phi of one pf candidate is -nan in those events. @slava77 so I guess its a general problem, not specific to the PFCandidatePrimaryVertexSorter algo

slava77 commented 2 years ago

Looping over particleFlow candidates (vector<reco::PFCandidate> "particleFlow" "" "RECO") in those problematic events, I see that pt, eta, phi of one pf candidate is -nan in those events. @slava77 so I guess its a general problem, not specific to the PFCandidatePrimaryVertexSorter algo

considering that it was a pion, is there a nan in the generalTracks collection?

swagata87 commented 2 years ago

yes the -nan PF candidate in both events were pions.

I do not see any nan in the general track pt/eta/phi in those events. The issue seem to be in PF charged hadron. It's also reflected in the fact that, while photon's charged hadron isolation is -nan; the track isolation values make sense:

In one event, both photons have: trkSumPtSolidCone=0 trkSumPtHollowCone=0

In the other event: photon1: trkSumPtSolidCone=17.5619 trkSumPtHollowCone=17.5619 photon2: trkSumPtSolidCone=38.5868 trkSumPtHollowCone=38.5868

slava77 commented 2 years ago

assign pf

cmsbuild commented 2 years ago

New categories assigned: pf

@kdlong,@juska you have been requested to review this Pull request/Issue and eventually sign? Thanks

swagata87 commented 2 years ago

As an intermediate solution, what we can do is to not consider such problematic PF candidates in isolation sum of photons. This way we won't have nan isolation for photons anymore. This does not address the root cause of why such PF candidates are reconstructed in the first place. But it's better than nothing, and serves the purpose for egamma.

swagata87 commented 2 years ago

right so now I checked that the issue also happen (rarely though) in data (checked in 2022C). I don't quite like photon's charged isolation being -nan; so I prepared a patch to fix photon's isolation: https://github.com/cms-sw/cmssw/pull/39120

swagata87 commented 2 years ago

okay after a few more checks, the real issue seem to here: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/RecoParticleFlow/PFProducer/src/PFCandConnector.cc#L159-L161

before doing that division, we should check that the divisor is >0

with this patch, the issue seems to be solved:

diff --git a/RecoParticleFlow/PFProducer/src/PFCandConnector.cc b/RecoParticleFlow/PFProducer/src/PFCandConnector.cc
index 6d3d1fa9ec8..bd96361516e 100644
--- a/RecoParticleFlow/PFProducer/src/PFCandConnector.cc
+++ b/RecoParticleFlow/PFProducer/src/PFCandConnector.cc
@@ -156,9 +156,10 @@ void PFCandConnector::analyseNuclearWPrim(PFCandidateCollection& pfCand,

   const math::XYZTLorentzVectorD& momentumPrim = primaryCand.p4();

-  math::XYZTLorentzVectorD momentumSec;
+  math::XYZTLorentzVectorD momentumSec(0,0,0,0);

-  momentumSec = momentumPrim / momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy());
+  if ( (momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy())) > 0.0 )
+      momentumSec = momentumPrim / momentumPrim.E() * (primaryCand.ecalEnergy() + primaryCand.hcalEnergy());

   map<double, math::XYZTLorentzVectorD> candidatesWithTrackExcess;
   map<double, math::XYZTLorentzVectorD> candidatesWithoutCalo;

I did not check the effect of the above patch on other objects (jet/met/tau etc)

perrotta commented 2 years ago

@swagata87 wouldn't it be enough checking that momentumPrim.E() > 0 ?

kdlong commented 2 years ago

Thanks a lot @swagata87. I don't really understand how the energy can be zero here, is the p4 also fully zero? Can you share the setup you're using for testing (just rerunning reach on the RAW event you shared above)?

swagata87 commented 2 years ago

wouldn't it be enough checking that momentumPrim.E() > 0 ?

Hello @perrotta, you are right; The divisor is just momentumPrim.E(), so checking that to be >0 is sufficient to solve the issue.

swagata87 commented 2 years ago

I don't really understand how the energy can be zero here, is the p4 also fully zero?

yeah.. no idea why momentumPrim.E() is zero, actually px, py, pz are all zero in the 2 events where I am checking these things.

Can you share the setup you're using for testing (just rerunning reach on the RAW event you shared above)?

yes I am running the RECO step on RAW. Below is the setup I have:

1) Run3 DATA CMSSW_12_4_6 Config file obtained with this cmsDriver command: cmsDriver.py --filein /store/data/Run2022C/EGamma/RAW/v1/000/355/872/00000/3b478527-d206-4b8e-8004-08e9aff7758b.root --fileout file:AOD.root --data --eventcontent AOD --runUnscheduled --customise Configuration/DataProcessing/Utils.addMonitoring --datatier AOD --conditions 124X_dataRun3_Prompt_v4 --step RAW2DIGI,RECO --geometry DB:Extended --era Run3 --python_filename aod_cfg.py --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --no_exec -n -1

Added eventsToProcess = cms.untracked.VEventRange('355872:546279379-355872:546279379') , to just run on the problematic event.

2) RelVal MC CMSSW_12_1_0_pre4 Config file obtained with this cmsDriver command: cmsDriver.py --filein 'file:/eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root' --fileout file:AOD.root --mc --eventcontent AODSIM --runUnscheduled --customise Configuration/DataProcessing/Utils.addMonitoring --datatier AODSIM --conditions 121X_mcRun3_2021_realistic_v10 --step RAW2DIGI,RECO --geometry DB:Extended --era Run3 --python_filename aod_cfg.py --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --no_exec -n -1

Added eventsToProcess = cms.untracked.VEventRange('1:36727-1:36727') , to just run on the problematic event.

jpata commented 2 years ago

Even though #39120 will mitigate it for photons, I could imagine there's a high likelihood a NaN PFCandidate will mess up also other things.

Therefore, @cms-sw/pf-l2 @laurenhay it might be useful to consider a fix in PF on a short timescale (or understanding if the issue comes from somewhere upstream in reco).

kdlong commented 2 years ago

Hi @jpata yes I am taking a look. We seem to have candidates with 0's for kinematics before the corrections which puts them to nans. Trying to understand and fix this early in the chain, will have news later this week.

kdlong commented 2 years ago

Hi all, sorry, it took me a bit longer to reproduce the error and figure out a suitable debug setup to chase down what's happening, but I think I have some useful info now.

The nan comes from the line identified by Swagata, in a step where the nuclear interactions are recombined into a single candidate. The original issue, however, is coming from a candidate with zero four momentum.

Looking at event 355872:546279379 from the file /store/data/Run2022C/EGamma/RAW/v1/000/355/872/00000/92d09274-5b85-48a0-9ed9-78c0b33ba560.root (it's mixed up above). The track is displaced associated to this candidate (hence the nuclear interaction recovery). It has valid kinematics, but a high qoverpError wrt qoverp, which causes its momentum to be rescaled for consistency with the total calo energy this is where the zero comes in. The trackRef key is 418, it has track.qoverpError() = 0.03167 and track.qoverp() = -0.02654. I checked if this is fixed by the mkFit pixelless revert in CMSSW_12_4_7, but it isn't, as the track isn't from a pixelless iteration: track.algoName() = "highPtTripletStep".

I believe what's happening in these lines is a rescaling the kinematics of all charged hadron tracks simultaneously such that the sum of charged hadron tracks is consistent with the sum of calo energy. Because the track error of this track is so high, it's getting a negative value from the fit, and is reset to zero here: https://github.com/cms-sw/cmssw/blob/master/RecoParticleFlow/PFProducer/src/PFAlgo.cc#L2510-L2511

in general I don't think zero is a reasonable fallback, as it gives a "ghost" candidate with no kinematics or mass. I guess it would be better to remove it at that point, or at the very least make sure the mass is taken into account to avoid problems downstream. At the moment I'm not sure how technically difficult it would be to remove a candidate at this point in the code, certainly one has to be careful. On the other hand, it might make sense to exclude these tracks from getting promoted to PF cands in the first place, but I need to think/investigate more for that.

I guess this issue is happening more often now if mkFit is giving a higher fraction of displaced tracks with high uncertainty. Is it known or expected? Maybe the mkFit experts want to provide some feedback on this particular track?

slava77 commented 2 years ago

I guess this issue is happening more often now if mkFit is giving a higher fraction of displaced tracks with high uncertainty. Is it known or expected? Maybe the mkFit experts want to provide some feedback on this particular track?

what is the direction of this track, and also the number of hits? mkFit tracks are on average shorter than previously in CKF; this can affect the resolution. This may also be a very forward 3-hit (actual 3-layer) track, at which point it's probably no matter if it's mkFit or CKF.

kdlong commented 2 years ago

Hi Slava, here is some more info on the track

In [9]: print(track.pt(), track.eta(), track.phi(), track.p())
5.456 -2.620 -1.916 37.67
In [12]: track.numberOfValidHits()
Out[12]: 5
In [13]: track.numberOfLostHits()
Out[13]: 0

slava77 commented 2 years ago

Hi Slava, here is some more info on the track

In [9]: print(track.pt(), track.eta(), track.phi(), track.p())
5.456 -2.620 -1.916 37.67
In [12]: track.numberOfValidHits()
Out[12]: 5
In [13]: track.numberOfLostHits()
Out[13]: 0

also in your original post The track is displaced How large in dxy(bs)? it's a bit odd to have a large displacement in the pixel-based iteration. The seeding region is limited to 0.02 cm.

kdlong commented 2 years ago

One simple thing I don't understand, though, is why we let the rescaleMomentum() function here rescale the full four vector. Don't we really want to rescale the 3 momentum instead? If we leave these ghosts candidates we should at least leave them with a coherent mass.

slava77 commented 2 years ago

https://github.com/cms-sw/cmssw/blob/5a3dd48730c2b88d67d1304788801e26e605cd36/RecoParticleFlow/PFTracking/src/PFTrackAlgoTools.cc#L259 is supposed to reject the track with dpt/pt >~ 10 * 0.06^2 ~ 0.036 ... or is this track from elsewhere (NuclearInteractionTrackImporter ?)

slava77 commented 2 years ago

In [12]: track.numberOfValidHits() Out[12]: 5

considering eta 2.6 and 5 hits, it would not be the shortness of the mkFit candidates. There is perhaps another feature: mkFit merges hits from overlapping seeds, assuming that the hit oulier rejection in the final fit would take care of it. In part this approach helps to increase the efficiency at high eta (compared to CKF). This may have an extra source of tails though.

kdlong commented 2 years ago

is supposed to reject the track with dpt/pt >~ 10 * 0.06^2 ~ 0.036 ... or is this track from elsewhere (NuclearInteractionTrackImporter ?)

You're right, it's coming in only at the NuclearInteractionTrackImporter, which is taking tracks from PFDisplacedVertexProducer. Have already been playing with the track quality cuts here with @laurenhay for another issue. We will consider this as well. I haven't quite tracked what's going on, but when the track is accessed here, it refers to refitted kinematics. I'm not sure what the interplay of CKF (which I guess would be used for a refit) and mikfit is then.

Looking into things in more detail, I see that turning PFCands to ghosts is used in other places in PF (e.g., the post cleaning, so probably the step that turns this into a 0 momentum candidate is fine. I still tend to think that it would be more correct to scale the 3 momentum instead of the four momentum. I implemented a patch (here)[https://github.com/kdlong/cmssw/commit/b84a8843e1aa00b4041533804967c92e70287734] to the PFCandidate rescaleMomentum function to do this, and it fixes the divide by zero problem here without introducing any technical issues that I see. It also seems to me that rescaleMomentum is not used almost anywhere else, so I think this is safe. Do others have an opinion on this, though? Maybe @bendavid @hatakeyamak?

slava77 commented 2 years ago

I'm not sure what the interplay of CKF (which I guess would be used for a refit) and mikfit is then.

there should be none: the final track fits are done with CKF in full tracking and in refits. mkFit is used only for the hit pattern recognition now.

kdlong commented 2 years ago

But won't it depend on the hits that mkfit has decided are part of the track?

slava77 commented 2 years ago

But won't it depend on the hits that mkfit has decided are part of the track?

Perhaps I misunderstood the initial concern.

My response implying no interplay was referring to using tracks directly vs refitted tracks, since IIUC, the refitter does not change the hit content.

Before digging more into CKF vs mkFit, is it obvious that the issue is not e.g. in misalignment or some other sources of mis/displaced hits in this data? Perhaps a check with --era Run3_noMkFit would help, although this is a bit random and may need more than just one event reprocessing.

mmusich commented 2 years ago

Before digging more into CKF vs mkFit, is it obvious that the issue is not e.g. in misalignment or some other sources of mis/displaced hits in this data?

I see it happens also in MC (see description: https://github.com/cms-sw/cmssw/issues/39110#issue-1344317299)

root://xrootd-cms.infn.it//store/mc/Run3Winter22DR/GJet_Pt-10to40_DoubleEMEnriched_TuneCP5_13p6TeV_pythia8/AODSIM/FlatPU0to70_122X_mcRun3_2021_realistic_v9-v2/2430000/6cd37543-62ec-4f62-9fa9-23b7c66f9c20.root

where the concern of misplaced hits is somewhat attenuated. If we had raw data it would be relatively trivial to reprocess with ideal conditions to check if that's the source of the issue (but as far as I understand we don't have it)

kdlong commented 2 years ago

Hi @slava77: I don't really mean to suggest this is an mkfit "problem." Clearly the situation that leads to a nan is a PF issue and should be fixed at the PF level. But there has been no protection for this condition since forever, and I assume people would have noticed nan PF candidates, so the logical assumption seems to be that the switch to mkfit is the source of the difference. I actually think the current treatment in PF to effectively zero out this cand in comparison to the calo energy is working well enough, we just need a protection for the divide by zero. But it would be interesting to understand why this is seemingly more common now.

@mmusich thanks, you're right, it's happening in MC. What do you mean about not having RAW data? The file I'm looking at is a RAW file.

kdlong commented 2 years ago

A do you mean we don't have the RAW datatier for MC?

mmusich commented 2 years ago

What do you mean about not having RAW data? The file I'm looking at is a RAW file.

I mean that I understood from a question asked from @swagata87 in this mattermost channel that we don't have the parent PDs for this dataset (including some data-tier that we can re-reconstruct)

slava77 commented 2 years ago

But there has been no protection for this condition since forever, and I assume people would have noticed nan PF candidates, so the logical assumption seems to be that the switch to mkfit is the source of the difference.

is the calorimeter response or calibration possibly something that can lead to a change in behavior (or did that not change since forever?)

kdlong commented 2 years ago

The calo calibrations changed, but this is a pretty small effect, and the response has been hardcoded for forever. The zeroing happens based on a sum of calo energy in a cluster compared to the sum of the track momentum in the cluster. The only thing that seems "special" to me is that the track momentum uncertainty is very high, so it's best fit value for rescaling is zero. Some output (coming from here):

     Compare Calo Energy to total charged momentum       sum p    = 69.5542 +- 45.4575       sum ecal = 0        sum hcal = 19.504        => Calo Energy = 17.828»
         case 1: COMPATIBLE |Calo Energy- total charged momentum| = 51.7256 < 1.4988 x 46.5869
         case 1: COMPATIBLE |Calo Energy- total charged momentum| = 51.7256 < 1.4988 x 46.5869
             max DP/P = 1.19337 >  0.1: take weighted averages
             max DP/P = 1.19337 >  0.1: take weighted averages
             track associated to hcal 0 P = 37.6768 +- 44.9625
             track associated to hcal 0 P = 37.6768 +- 44.9625
             track associated to hcal 1 P = 21.2718 +- 6.66727
             track associated to hcal 1 P = 21.2718 +- 6.66727
             track associated to hcal 2 P = 4.27024 +- 0.553705
             track associated to hcal 2 P = 4.27024 +- 0.553705
             track associated to hcal 3 P = 1.11607 +- 0.0169243
             track associated to hcal 3 P = 1.11607 +- 0.0169243
             track associated to hcal 4 P = 1.32559 +- 0.0322203
             track associated to hcal 4 P = 1.32559 +- 0.0322203
             track associated to hcal 5 P = 1.08917 +- 0.0104476
             track associated to hcal 5 P = 1.08917 +- 0.0104476
            track associated to hcal 6 P = 0.381433 +- 0.00804561
             track associated to hcal 6 P = 0.381433 +- 0.00804561
             track associated to hcal 7 P = 1.27163 +- 0.0220997
             track associated to hcal 7 P = 1.27163 +- 0.0220997
             track associated to hcal 8 P = 0.449754 +- 0.00472474
            track associated to hcal 8 P = 0.449754 +- 0.00472474
            track associated to hcal 9 P = 0.366177 +- 0.00426354
            track associated to hcal 9 P = 0.366177 +- 0.00426354
             track associated to hcal 10 P = 0.33557 +- 0.00407122
             track associated to hcal 10 P = 0.33557 +- 0.00407122
             old p 37.6768 new p -10.5045 rescale 0
             old p 37.6768 new p -10.5045 rescale 0
             old p 21.2718 new p 20.2124 rescale 0.950195
             old p 21.2718 new p 20.2124 rescale 0.950195
             old p 4.27024 new p 4.26293 rescale 0.998289
             old p 4.27024 new p 4.26293 rescale 0.998289

swagata87 commented 2 years ago

What do you mean about not having RAW data? The file I'm looking at is a RAW file.

I mean that I understood from a question asked from @swagata87 in this mattermost channel that we don't have the parent PDs for this dataset (including some data-tier that we can re-reconstruct)

yes the parent dataset of GJet 122X MC is not available. There is a relVal MC where the issue was seen, for that we have the RAW, not sure if it is useful, but this is the raw: /eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root (https://github.com/cms-sw/cmssw/issues/39110#issuecomment-1223779784)

mmusich commented 2 years ago

There is a relVal MC where the issue was seen, for that we have the RAW, not sure if it is useful, but this is the raw: /eos/cms/store/relval/CMSSW_12_1_0_pre4/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/121X_mcRun3_2021_realistic_v10_HighStat-v2/10000/22c2fc3b-069f-4437-ab6d-edf0a9c0dfc7.root

thanks @swagata87 it is useful.

mmusich commented 2 years ago

using the setup at https://github.com/cms-sw/cmssw/issues/39110#issuecomment-1223779784 but changing the global tag to 121X_mcRun3_2021_design_v9 (including ideal alignment / calibrations) the warning

%MSG-w PrimaryVertexSorting:   PFCandidatePrimaryVertexSorter:primaryVertexAssociationJME  05-Sep-2022 12:18:59 CEST Run: 1 Event: 36727
Scaling is NAN ignoring this candidate/track
%MSG

is gone. So it seems it's a feature of the calibrations used in the realistic GT. Addin @cms-sw/trk-dpg-l2 to the thread.

mmusich commented 2 years ago

can someone of you prepare a list of the DetIds of the hits associated to the problematic track in the MC event?

kdlong commented 2 years ago

can someone of you prepare a list of the DetIds of the hits associated to the problematic track in the MC event?

@mmusich was this for me or for the dpg conveners? I spent some time messing with it, The trackRef key = 3 and ID = 1463 in the AOD that I produce from the RAW, but tracks.recHits() hasn't been stored. Can you give me the keep statement to store all the tracks so this works?

mmusich commented 2 years ago

was this for me or for the dpg conveners?

to anyone who can devote some some to this :)

I spent some time messing with it, The trackRef key = 3 and ID = 1463 in the AOD that I produce from the RAW

thanks, this is already helping.

Can you give me the keep statement to store all the tracks so this works?

I think that if you retain the RECO data-tier, instead of AOD you will have all the info needed.

kdlong commented 2 years ago

Thanks, good point. I'm still not 100% sure about the interface, if I do track.recHit(i) I get a SiPixelRecHit which always returns null from its det function. I can give you the collection, keys, and raw ID from these objects, though:

for i in range(track.recHitsSize()):
     rh = track.recHit(i)
     print(rh.id().id(), rh.key(), rh.get().rawId())

gives

454 40 303063048
454 41 344241156
454 42 344242180
454 43 344503300
454 44 344504324
454 45 344765444
454 46 344766468

slava77 commented 2 years ago

Thanks, good point. I'm still not 100% sure about the interface, if I do track.recHit(i) I get a SiPixelRecHit which always returns null from its det function.

can you try to print details from the reco job itself?

kdlong commented 2 years ago

@slava77 not immediately sure where I should add the printouts, probably someone from the tracking POG would be able to do this quicker than me.

Addressed from the PF side in #39368

clacaputo commented 2 years ago

type pf

shdutta16 commented 2 years ago

I noticed that one of the algorithms (that I was running) was quitting with the error that the charged hadron isolation had NaN value, while running it over Run 3 GJet samples. This is actually when I noticed this issue. When I had run the same algorithm on Run 2 GJet samples earlier, this did not crop up. So, most likely Run 2 samples did not have this issue.

mmusich commented 2 years ago

I add a further look to this:

454 40 303063048 454 41 344241156 454 42 344242180 454 43 344503300 454 44 344504324 454 45 344765444 454 46 344766468

these are all Pixel DetIds:

Info on the DetIds

``` Det raw: 303063048 0 1 1 Barrel layer, ladder, module 1 6 2 shell mO(1) 4 1 -3 -3 0 BPix_BmO_SEC4_LYR1_LDR3F_MOD3 v2x8 303063048 303063048 BPix_BmO_SEC4_LYR1_LDR3F_MOD3 r/phi/z = 2.73876/2.8958/-16.75 cmssw layer/ladder/module 1/6/2 E-phi -0.245795 Position 2.73876 -10.05 2.8958 0.0285 160 416 Det: 344241156 0 1 2 endcap, side=1 disk=1, blade=11, panel=1, plaq=1 mO FPix_BmO_D1_BLD5_PNL1_RNG1 1 5 1 1 1 v2x8 344241156 344241156 FPix_BmO_D1_BLD5_PNL1_RNG1 r/phi/z = 7.77775/2.78724/-33.6689 cmssw side/disk/blade/pannel/plaq=1/1/11/1/1 E-z 0.919158 Position 7.77775 -32.069 3.07284 0.029 160 416 Det: 344242180 0 1 2 endcap, side=1 disk=1, blade=11, panel=2, plaq=1 mO FPix_BmO_D1_BLD5_PNL2_RNG1 1 5 2 1 1 v2x8 344242180 344242180 FPix_BmO_D1_BLD5_PNL2_RNG1 r/phi/z = 7.81674/2.92065/-33.4758 cmssw side/disk/blade/pannel/plaq=1/1/11/2/1 E-z -0.919158 Position 7.81674 -31.8759 -3.07694 0.029 160 416 Det: 344503300 0 1 2 endcap, side=1 disk=2, blade=11, panel=1, plaq=1 mO FPix_BmO_D2_BLD5_PNL1_RNG1 2 5 1 1 1 v2x8 344503300 344503300 FPix_BmO_D2_BLD5_PNL1_RNG1 r/phi/z = 7.77775/2.78724/-41.1689 cmssw side/disk/blade/pannel/plaq=1/2/11/1/1 E-z 0.919158 Position 7.77775 -39.569 3.07284 0.029 160 416 Det: 344504324 0 1 2 endcap, side=1 disk=2, blade=11, panel=2, plaq=1 mO FPix_BmO_D2_BLD5_PNL2_RNG1 2 5 2 1 1 v2x8 344504324 344504324 FPix_BmO_D2_BLD5_PNL2_RNG1 r/phi/z = 7.81674/2.92065/-40.9758 cmssw side/disk/blade/pannel/plaq=1/2/11/2/1 E-z -0.919158 Position 7.81674 -39.3759 -3.07694 0.029 160 416 Det: 344765444 0 1 2 endcap, side=1 disk=3, blade=11, panel=1, plaq=1 mO FPix_BmO_D3_BLD5_PNL1_RNG1 3 5 1 1 1 v2x8 344765444 344765444 FPix_BmO_D3_BLD5_PNL1_RNG1 r/phi/z = 7.77775/2.78724/-50.6689 cmssw side/disk/blade/pannel/plaq=1/3/11/1/1 E-z 0.919158 Position 7.77775 -49.069 3.07284 0.029 160 416 Det: 344766468 0 1 2 endcap, side=1 disk=3, blade=11, panel=2, plaq=1 mO FPix_BmO_D3_BLD5_PNL2_RNG1 3 5 2 1 1 v2x8 344766468 344766468 FPix_BmO_D3_BLD5_PNL2_RNG1 r/phi/z = 7.81674/2.92065/-50.4758 cmssw side/disk/blade/pannel/plaq=1/3/11/2/1 E-z -0.919158 Position 7.81674 -48.8759 -3.07694 0.029 160 416 ```

namely it's a track with 1 hit in BPix 1 and 6 hits on side 1 of FPix (suggesting it's very forward).

is it obvious that the issue is not e.g. in misalignment or some other sources of mis/displaced hits in this data?

Comparing the rigid body alignment parameters (3 cartesian coordinates of the sensor active area center and corresponding Euler angles) of these DetIds between the "realistic" misaligned geometry and the "ideal" one I don't see exceedingly worrysome deviations:

albeit there's one DetId of FPix that has somewhat high global-z displacement (around 60um), considering the RMS of the whole set is about 20 μm:

On the other hand I have seen that using this overridden set of conditions on top of the recipe at https://github.com/cms-sw/cmssw/issues/39110#issuecomment-1236821688

from Configuration.AlCa.GlobalTag import GlobalTag
process.GlobalTag = GlobalTag(process.GlobalTag, '121X_mcRun3_2021_realistic_v10', '')
process.GlobalTag.toGet = cms.VPSet(                                                                                                                                                                      
    cms.PSet(record = cms.string("TrackerSurfaceDeformationRcd"),
             tag = cms.string("TrackerSurfaceDeformations_zero")
    )
)

is sufficient to remove the warning Scaling is NAN ignoring this candidate/track. So it looks like it is actually caused by the non-rigid body degrees of freedom.

swagata87 commented 12 months ago

+1

fixed by https://github.com/cms-sw/cmssw/pull/39368

stahlleiton commented 4 months ago

@cms-sw/reconstruction-l2 : Is there something pending to close this issue?

cms-sw / cmssw

reco::Candidate pt = -nan in some rare cases #39110