Bug in ZZ candidate leptons assignment

bonanomi commented 7 months ago

I believe the getLeptons function has a bug:

The four leptons found are different if one defines leps = list(electrons) + list(muons) or leps = list(muons) + list(electrons) and this is an undesired behavior.
For events with more than 4 leptons in total, sometimes we end up selecting the wrong four leptons.

Example (EventNumber==4057199 from /store/mc/Run3Summer22EEMiniAODv4/ZHto2Zto4L_M125_TuneCP5_13p6TeV_powheg2-minlo-HZJ-JHUGenV752-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_postEE_v6-v2/30000/6fdd2d9e-7213-4815-b37d-b3dfc6a5cdb5.root):

In miniAOD this is a 2e2mu event, with the following four leptons:

in nanoAOD with the current implementation of getLeptons we get the following 4 leptons:

with the initial collection of Electron and Muon being:

so we end up selecting the first two leptons (electrons) correctly (they are at ZZCand_Z1l1Idx and ZZCand_Z1l2Idx 1 and 0, respectively), but the third and fourth leptons (muons) are wrong (we have ZZCand_Z2l1Idx and ZZCand_Z2l2Idx being 3 and 4, respectively).

Is this a bug with getLeptons or with the assignment of the ZZCand_Z2l1Idx and ZZCand_Z2l2Idx indices?

@namapane do you have any suggestion?

bonanomi commented 7 months ago

The issue wit this specific example was in the different selection of the ZZ candidate (by DKin vs Z2 pT). Now everything is in agreement. Only two events are out of synch but need to understand why. Closing this issue for now.

namapane commented 7 months ago

Hi Matteo, BTW it would make sense to use the same selection for all analyses. For most published results we used Dkin, can we agree to stick to that? Regarding remaining differences, at this rate they can be due to rounding of variables in nano, which occasionally can move a lepton outside acceptance. For electrons there are also rare cases where the rounding moves an e to a different bin in the bdt cuts. I have been debugging many of these differences in the past and I can have a look to these 2 candidates if you wish.

AlessandroTarabini commented 7 months ago

Hi Nicola, regarding the selection of the best ZZ candidate, we should stick to the "highest pT" for fiducial analyses. The definition of the fiducial phase space should only include cut-based requirements (or at least selections easily reproducible by theoreticians). Since the aim of fiducial analysis is to maximise the overlap between the fiducial phase space and the detector-level, we should use the same selections at both levels.

namapane commented 7 months ago

Ok but this means we have to keep separate productions, etc. with the resulting extra work and confusion. We could also switch back to higherpt for everything, but using Dkin was shown to handle the candidate choice better in associated production. We could recheck that...

Il giorno 4 apr 2024, alle ore 16:52, AlessandroTarabini @.**@.>> ha scritto:

Hi Nicola, regarding the selection of the best ZZ candidate, we should stick to the "highest pT" for fiducial analyses. The definition of the fiducial phase space should only include cut-based requirements (or at least selections easily reproducible by theoreticians). Since the aim of fiducial analysis is to maximise the overlap between the fiducial phase space and the detector-level, we should use the same selections at both levels.

— Reply to this email directly, view it on GitHubhttps://github.com/CJLST/ZZAnalysis/issues/236#issuecomment-2037428298, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHN4UAQD2SBZN5L4YHII7TY3VSJRAVCNFSM6AAAAABFW6RFRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZXGQZDQMRZHA. You are receiving this because you were mentioned.Message ID: @.***>

AlessandroTarabini commented 7 months ago

Since everything with NanoAOD is based on indices, we could include two bestCandIdx indices, one for Dkin and one for higherPt. In that case, we have one single production and one single set of CJLST ntuples, so it is left to the single analysis which candidate to choose.

namapane commented 7 months ago

That is possible but requires some additional code because:

we would need to duplicate not only the bestCandIdx but also the indices for the control regions
the default setting in production is to save only the best candidate in the SR and in each of the CRs, to space and CPU. (options to store all candidates are also implemented, for special studies like for cut optimization). It would have to be extended so that all interesting candidates are stored. Still I think that having two slightly different recipes is problematic at least because it regularly creates confusion, even among experts (like in this issue...), and I guess even more among newcomers, so I think that rechecking the advantage of using Dkin for the non-fiducial analysis (or looking for a different criterion that would fit both analyses) would not be a bad idea. I'll open an issue to keep track of this.

namapane commented 7 months ago

BTW all recent RunIII productions have been done with the Dkin selection, right?

bonanomi commented 7 months ago

Hi, I would also avoid having two flags for two different selection criteria of the ZZ candidates, as it can create confusion in the usage and in the book keeping. Nicola, yes, now that you mention it, I believe that all the productions ran so far have Dkin as selection criteria. I have to check on the 125 GeV signal samples, because it may be that I used Z2 PT criterion in my private production.

namapane commented 7 months ago

For the time being I opened #238 to keep track of what adding the two set of indices would imply. I also added a note about recheching the Dkin criterion and/or finding a common one.

The problem of confusion is very real, as you point out we may have an inconsistent set of samples as different people may have set their area in different ways, so in any case we would need to find a way to standardize productions.

CJLST / ZZAnalysis

Bug in ZZ candidate leptons assignment #236