HEP-KBFI / TallinnNtupleProducer

code, python scripts and config files for producing "plain" Tallinn Ntuples
3 stars 2 forks source link

Propagate necessary weights for estimating MC closure #27

Open ktht opened 1 year ago

ktht commented 1 year ago

At the moment we include the FR weight in the nominal event weight. However, when estimating the contribution of MC in fake closure regions where leptons of one flavor are kept tight while the leptons of opposite flavor are relaxed to fakeable but not tight, then it means that we have to apply the FR weights to only those events that conform to the latter case.

However, I believe the easiest way to implement it is to factor out the FR weights that correspond to the flavor of tight leptons at the analysis level. In other words, we need to have two more additional branches: FR weights for electrons (eg frWeight_e) and FR weights for muons (frWeight_m). When we estimate the MC closure contribution for electrons and muons, we just divide the nominal event weight with the FR weight of muons and electrons, respectively, and fill the histograms.

saswatinandan commented 1 year ago

Fake_Rate weight is considered in the evtweight calculation here and here all leptons are looped over and fake weight is estimated only for those leptons which fail tight selection and for those passing tight selection it is 1. So I don't think we need any additional branch for Fake_Rate weight.

ktht commented 1 year ago

This is sufficient when estimating fakes but not for MC closure, where we want to apply the FR weights based on the lepton flavor. This distinction is currently not handled.

On Mon, Oct 10, 2022, 4:05 PM saswatinandan @.***> wrote:

Fake_Rate weight is considered in the evtweight calculation here https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/main/EvtWeightTools/src/EvtWeightRecorder.cc#L61 and here https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/main/EvtWeightTools/src/EvtWeightRecorder.cc#L1149-L1176 all leptons are looped over and fake weight is estimated only for those leptons which fail tight selection and for those passing tight selection it is 1. So I don't think we need any additional branch for Fake_Rate weight.

— Reply to this email directly, view it on GitHub https://github.com/HEP-KBFI/TallinnNtupleProducer/issues/27#issuecomment-1273285588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPR6EFGWELSCPGWBQAR6DTWCQIANANCNFSM6AAAAAAQY4FWN4 . You are receiving this because you authored the thread.Message ID: @.***>

saswatinandan commented 1 year ago

Suppose we are in electron,muon channel and we want to consider electron closure shape correction, then we will consider those events only with fakeable electron but not tight and tight muon. We can check it from the branch isTight and isFakeable branch and lepton flavour can be checked from pdgId. And fake weights are obtained from here where fake weight is already considered. Isn't it sufficient or something is missing.

ktht commented 1 year ago

OK so it could work as you described, but the event selection string grows exponentially with the multiplicity of leptons. I find it easier to understand and simpler to implement if we just require all particles to pass the fakeable selection and undo the FR for a given lepton flavor.

However, since the current approach (of not adding any new branches to the Ntuple) is more favorable in terms of file size (which is quite a problem for us -- although there's been zero effort to solve it), then I'm open to the idea that we implement the complicated event selection string that takes all possible permutations of lepton flavors and tightness conditions into account. In that case we need some piece of code that generates the selection string (when creating the cfg files for analysis jobs), given lepton multiplicity and lepton flavor for which we want to derive the MC closure for.

On Tue, Oct 11, 2022, 3:26 PM saswatinandan @.***> wrote:

Suppose we are in electron,muon channel and we want to consider electron closure shape correction, then we will consider those events only with fakeable electron but not tight and tight muon. We can check it from the branch isTight and isFakeable branch and lepton flavour can be checked from pdgId. And fake weights are obtained from here https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/main/EvtWeightTools/src/EvtWeightRecorder.cc#L57 where fake weight is already considered. Isn't it sufficient or something is missing.

— Reply to this email directly, view it on GitHub https://github.com/HEP-KBFI/TallinnNtupleProducer/issues/27#issuecomment-1274605758, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPR6EASWJS3HSQS5BBFKMLWCVMIDANCNFSM6AAAAAAQY4FWN4 . You are receiving this because you authored the thread.Message ID: @.***>

veelken commented 1 year ago

Hi,

I think Saswati's idea to handle the MC closure via custom event selection strings is a good idea. I think we actually need custom event selection strings anyway, Karl.

If I recall correctly, we compute the fake MC closure systematics separately for electrons (Clos_e) and muons (Clos_m). When we compute the Clos_e systematic, we relax the electron selection to fakeable, while keeping the muon selection tight, and apply the FR weights to the fakeable electron. So, I believe we anyway need a custom event selection string to relaxe the lepton selection from tight to fakeable only for electrons and not for muons. The FR weights are already correctly included in the evtWeight (except that we need to switch from FR measured in data to FR obtained in MC). A similar reasoning applies for computing the Clos_m sytematic.

Or am I missing something ?

ktht commented 1 year ago

AFAICT you're both correct. I did not consider the case where the event has both fake electrons and muons but neither of which are tight -- such events end up in fake AR but not in MC closure -- so the selection string has to differ wrt the fake AR. If that weren't the case, then it'd have been easier to just save the extra weights imo.

Thus, we need a python function that generates:

when creating cfg files for analysis jobs and apply the nominal event weight in all cases. It could take the number of leptons and the type of analysis region (SR, fake AR, MC closure for e/mu) as input and return one of those strings given above. For taus we can have a separate function and ignore the PDG ID info. I think it can be implemented in our job distribution framework (right?). Or what do you think?

On Tue, Oct 11, 2022, 5:20 PM Christian Veelken @.***> wrote:

Hi,

I think Saswati's idea to handle the MC closure via custom event selection strings is a good idea. I think we actually need custom event selection strings anyway, Karl.

If I recall correctly, we compute the fake MC closure systematics separately for electrons (Clos_e) and muons (Clos_m). When we compute the Clos_e systematic, we relax the electron selection to fakeable, while keeping the muon selection tight, and apply the FR weights to the fakeable electron. So, I believe we anyway need a custom event selection string to relaxe the lepton selection from tight to fakeable only for electrons and not for muons. The FR weights are already correctly included in the evtWeight (except that we need to switch from FR measured in data to FR obtained in MC). A similar reasoning applies for computing the Clos_m sytematic.

Or am I missing something ?

— Reply to this email directly, view it on GitHub https://github.com/HEP-KBFI/TallinnNtupleProducer/issues/27#issuecomment-1274769001, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPR6EA5JBVXI54BFNRPPLLWCVZTTANCNFSM6AAAAAAQY4FWN4 . You are receiving this because you authored the thread.Message ID: @.***>

veelken commented 1 year ago

Hi,

I have an alternative proposal. We could add 2 flags:

to the RecoLepton class and 2 corresponding branches to the "plain" Ntuple (this would be an addition to the C++ code, which would work in the same way for all channels, as far as I can see).

These extra branches would allow to simply

in the event selection string when running the analysis code to compute the Clos_e systematics. And to make a similar replacement when running on the Clos_m systematic.

I think this would work (and would require the most minimal amount of coding to replace the event selection string).

What do you think ?

saswatinandan commented 1 year ago

AFAICT you're both correct. I did not consider the case where the event has both fake electrons and muons but neither of which are tight -- such events end up in fake AR but not in MC closure -- so the selection string has to differ wrt the fake AR. If that weren't the case, then it'd have been easier to just save the extra weights imo. Thus, we need a python function that generates: "(lep1_isTight && lep2_isTight && ... && lepN_isTight)" for the SR; "(lep1_isFake && lep2_isFake && ... && lepN_isFake) && ! (lep1_isTight && lep2_isTight && ... && lepN_isTight)" for the fake AR; * Permutations of "(lepA_isFake && ! lepA_isTight && (lepA_pdgId == 11 || lepA_pdgId == -11)" and "lepB_isTight && (lepB_pdgId == 13 || lepB_pdgId == -13)" to get MC closure for electrons (and swap 11 and 13 to get MC closure for muons); when creating cfg files for analysis jobs and apply the nominal event weight in all cases. It could take the number of leptons and the type of analysis region (SR, fake AR, MC closure for e/mu) as input and return one of those strings given above. For taus we can have a separate function and ignore the PDG ID info. I think it can be implemented in our job distribution framework (right?). Or what do you think? On Tue, Oct 11, 2022, 5:20 PM Christian Veelken @.> wrote: Hi, I think Saswati's idea to handle the MC closure via custom event selection strings is a good idea. I think we actually need custom event selection strings anyway, Karl. If I recall correctly, we compute the fake MC closure systematics separately for electrons (Clos_e) and muons (Clos_m). When we compute the Clos_e systematic, we relax the electron selection to fakeable, while keeping the muon selection tight, and apply the FR weights to the fakeable electron. So, I believe we anyway need a custom event selection string to relaxe the lepton selection from tight to fakeable only for electrons and not for muons. The FR weights are already correctly included in the evtWeight (except that we need to switch from FR measured in data to FR obtained in MC). A similar reasoning applies for computing the Clos_m sytematic. Or am I missing something ? — Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPR6EA5JBVXI54BFNRPPLLWCVZTTANCNFSM6AAAAAAQY4FWN4 . You are receiving this because you authored the thread.Message ID: @.>

I guess for application region, minimum one lepton should be fakeable and not tight and maximum all leptons can be fakeable but not tight i.e.

ktht commented 1 year ago

It's the same thing.

saswatinandan commented 1 year ago

Right. Then we actually don't need this (lep1_isFake && lep2_isFake && ... && lepN_isFake) as all leptons are fakebale https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/main/Writers/plugins/RecoLeptonWriter.cc#L132 so only "! (lep1_isTight && lep2_isTight && ... && lepN_isTight)" this is sufficient for AR

ktht commented 1 year ago

Hi,

I have an alternative proposal. We could add 2 flags:

  • isClos_e = isTight || (is_electron && isFakeable)
  • isClos_mu = isTight || (is_muon && isFakeable)

to the RecoLepton class and 2 corresponding branches to the "plain" Ntuple (this would be an addition to the C++ code, which would work in the same way for all channels, as far as I can see).

These extra branches would allow to simply

  • replace "lep1_isTight" -> "lep1_isClos_e" and "lep2_isTight" -> "lep2_isClos_e"
  • replace "ntightlep = 2" -> "ntightlep <= 1"

in the event selection string when running the analysis code to compute the Clos_e systematics. And to make a similar replacement when running on the Clos_m systematic.

I think this would work (and would require the most minimal amount of coding to replace the event selection string).

What do you think ?

Hi Christian, shouldn't it be

?

In any case, the code that produces the selection string would be the same at the end of the day, ie it doesn't matter if we concatenate "isTight" flags with "&&" delimiters to obtain event in the SR, or "isClose_e" / "isTight || (is_electron && isFakeable && ! isTight)" flags to get events in the Clos_e region. It basically comes down to whether or not we want to keep the number of branches to an absolute minimum or make the selection string slightly more readable. I have no preference here.

Saswati, I see no point in being so nitpicky. If we want to relax the lepton selection for the purpose of training an ML model for example, then we would probably need to save loose leptons instead of fakeable ones.

veelken commented 1 year ago

Hi Karl,

I think

is wrong. My understanding is that at least one of the leptons must not pass the tight lepton selection (to avoid overlap with the signal region). It is wrong to demand that all leptons fail the tight selection. Think about an event with 2 electrons in the 2lss channel. For this event, the isClos_m systematic requires that both electrons pass the tight selection, while the isClos_e systematic allows for either 2 fakeable or 1 fakeable + 1 tight electron, mimicking the selection that we apply in the fake background control region for the real data.

I realized that I made a mistake too: It is wrong to replace "ntightlep = 2" -> "ntightlep <= 1" for Clos_e as well as for Clos_m. Instead, we need to:

in order to reproduce the behaviour in https://github.com/HEP-KBFI/tth-htt/blob/master/bin/analyze_2lss.cc#L1737-L1739

ktht commented 1 year ago

So I now had more time to focus on this topic. Sorry for the very long post but I don't see any other way than to explicitly spell it all out.

I almost agree with your proposal, Christian, if it weren't for this one caveat that we haven't discussed at all: the gen-matching status of the selected leptons and taus. The second line that you highlighted in your latest reply says that a given event is vetoed in MC closure region for electron/muons if all selected objects are tight but at least one of those objects is a non-prompt electron/muon. Note that the functions countFakeElectrons and countFakeMuons determine the number of electrons and muons that are matched to generator-level jets. Please also note that we only select those events in the MC closure regions that have at least one non-prompt lepton, which is accomplished by considering only those histograms that have "_fake" suffix in their name.

It follows that the proposed selection fails to consider 2lss events as coming from MC closure for electrons if a prompt electron and a non-prompt muon in the event both pass the tight cuts. In the past I've tried to argue for removal of such contributions from the MC closure region (see this thread and the PDF file linked within), but unfortunately it introduced too large discrepancy between MC closure and fakes MC. Thus, if we want to replicate the same prescription as before, we then need to replace the proposed "(ntightlep <= 1 || nmuons = 0)" and "(ntightlep <= 1 || nelectrons = 0)" cuts with other criteria (which I'll show below). Alternatively, we could measure fakeable-to-tight identification efficiencies and then figure out how to reconcile those with the fake factor method, or we could abandon the fake factor method altogether in favor of the full matrix approach that includes both fake rates and identification efficiencies (in low-multiplicity channels where inverting such matrix is mathematically valid).

For the sake of completeness I'll discuss all possible use-cases we might encounter in our analysis as to how the phase space can be partitioned based on the promptness and tightness criteria of selected leptons and taus. I'll focus only on the leptons in the following to keep things simple (mostly because taus have different gen-matching codes and no pdgId attributes). At the very end you'll find some tables that demonstrate how the partitioning is supposed to work.

First, we have data events, which do not have any generator-level information available for obvious reasons:

In MC we have several ways of dividing the phase space:

In my opinion, the most efficient way to implement it all would be still at the analysis level. In this case we can drop the redundant isFake, isFlip or isClos_e/m branches, since there is no clear-cut way around creating the long event selection string regardless. Plus, it gives more flexibility for specifying the gen-matching conditions and reduces the Ntuple size. I think the piece of code that does all that can be moved to the framework that generates the config files. It can be generalized such that all these details above are hidden from the end-user.

Another issue is that we have used different (ie QCD-driven) fake rates (FR) in MC regions compared to FMC or data fakes, where we use data-driven FR. If we want to support this feature (which I think is the case because it allows to encapsulate potential differences in flavor composition between the measurement and application regions), then the easiest way to accomplish it is to:


edit: forgot to explain what each letter in the following tables stand for:

Single lepton channel:

e F T
N MCe FMC, MCm
P PMC SR
m F T
N MCm FMC, MCe
P PMC SR

Dilepton channel:

ee FF FT TF TT
NN MCe MCe MCe FMC, MCm
NP MCe MCe MCe FMC, MCm
PN MCe MCe MCe FMC, MCm
PP PMC PMC PMC SR
em FF FT TF TT
NN MCe MCm FMC
NP MCe MCm FMC, MCm
PN MCe MCm FMC, MCe
PP PMC PMC PMC SR
mm FF FT TF TT
NN MCm MCm MCm FMC, MCe
NP MCm MCm MCm FMC, MCe
PN MCm MCm MCm FMC, MCe
PP PMC PMC PMC SR

2l+1tau channel (assuming that taus are required to be gen-matched in the SR):

eet FFF FFT FTF TFF FTT TFT TTF TTT
NNN MCe MCe MCe MCt FMC, MCm
NNP MCe MCe MCe MCt FMC, MCm, MCt
NPN MCe MCe MCe MCt FMC, MCm
PNN MCe MCe MCe MCt FMC, MCm
NPP MCe MCe MCe MCt FMC, MCm, MCt
PNP MCe MCe MCe MCt FMC, MCm, MCt
PPN MCe MCe MCe MCt FMC, MCm, MCe
PPP PMC PMC PMC PMC PMC PMC PMC SR
emt FFF FFT FTF TFF FTT TFT TTF TTT
NNN MCe MCm MCt FMC
NNP MCe MCm MCt FMC, MCt
NPN MCe MCm MCt FMC, MCm
PNN MCe MCm MCt FMC, MCe
NPP MCe MCm MCt FMC, MCm, MCt
PNP MCe MCm MCt FMC, MCe, MCt
PPN MCe MCm MCt FMC, MCe, MCm
PPP PMC PMC PMC PMC PMC PMC PMC SR
mmt FFF FFT FTF TFF FTT TFT TTF TTT
NNN MCm MCm MCm MCt FMC, MCe
NNP MCm MCm MCm MCt FMC, MCe, MCt
NPN MCm MCm MCm MCt FMC, MCe
PNN MCm MCm MCm MCt FMC, MCe
NPP MCm MCm MCm MCt FMC, MCe, MCt
PNP MCm MCm MCm MCt FMC, MCe, MCt
PPN MCm MCm MCm MCt FMC, MCe, MCm
PPP PMC PMC PMC PMC PMC PMC PMC SR

Of course, if taus are not required to be gen-matched (and data/MC SF is applied instead, as was the case with 2lss+1tau and 3l+1tau channels in ttH multilepton analysis), then the phase space partitioning is identical to the dilepton case.