Possible bug in the handling fo the normalization in the skimmers

francescobrivio commented 1 year ago

While re-checking the sample normalization applied in the skimmers for the EFT studies of the non-resonant Run 2 analysis I noticed a possible bug in the skimmers. More precisely, in this line (which is used in the numerator when the samples are normalized) https://github.com/LLRCMS/KLUBAnalysis/blob/a5742e528486bc1f5a172872755948b3e84165fb/test/skimNtuple2016_HHbtag.cpp#L2263 the weight topPtReweight is missing. This same weight is instead added in the denominator in these lines: https://github.com/LLRCMS/KLUBAnalysis/blob/a5742e528486bc1f5a172872755948b3e84165fb/test/skimNtuple2016_HHbtag.cpp#L1494-L1497

The topPtReweight is computed in the gen-section of the skimmers and it's different from 1 only for TT samples. Even though the branches containing the topPtReweight weight are never used later on in the analysis, as shown above, this weight is actually included in the denominator (totalEvents), but not in the numerator.

I quickly looked at the TTtopPtreweight_up branch of the skims (which is actually filled with topPtReweight, as shown here) for a skim file of the TT_semiLep sample and I can confirm the value is != 1:

@portalesHEP @dzuolo @jonamotta this is most probably a bug that we missed in the past, but further checks (i.e. including re-skimming the TT samples with the addition of this weight in the numerator) are in order.

portalesHEP commented 1 year ago

Thanks for the heads up! I'm not entirely clear on why this weight is needed, and I found out that its value is hardcoded in the skimmer: https://github.com/LLRCMS/KLUBAnalysis/blob/a5742e528486bc1f5a172872755948b3e84165fb/test/skimNtuple2016_HHbtag.cpp#L73-L74 https://github.com/LLRCMS/KLUBAnalysis/blob/a5742e528486bc1f5a172872755948b3e84165fb/test/skimNtuple2016_HHbtag.cpp#L1139-L1141 Do you know where these values come from and if we should simply remove them from the sum of weight or rather add them back to the event weight for TT events?

dzuolo commented 1 year ago

Hi @portalesHEP! If we understood correctly the twiki https://twiki.cern.ch/twiki/bin/viewauth/CMS/TopPtReweighting we are in category "Case 3.1: Analyses with SM tt as background (not in signal)" and, specifically: "In a control region which is signal-depleted and tt-enriched, one should check the data-MC agreement of the main distributions of the analysis, together with the top pT. If the agreement between the data and MC is within the available uncertainties (syst. and stat.) then the effect of top pT mismodeling can be considered covered by the existing uncertainties and no additional correction or uncertainty is needed." So we should remove this weight from the sum of weight and check the data/MC agreement in the inverted resolved2b0j category.

jonamotta commented 1 year ago

To my knowledge, the reweighting is done according to what is written in this page. Even if I admit that by looking at it now, the hardcoded numbers coincide but the method itself no...

portalesHEP commented 1 year ago

Ok, thanks for the pointers. Then I'd tend to agree that we should remove the weights before the next skimming round. I suppose that this should not have had any critical impact on the non-resonant result though (?), since there was a dedicated TT correction extracted which should've absorbed any issue introduced by this bug

jonamotta commented 1 year ago

I am not sure I agree with either of the points.

I would say that the weights should be correctly re-introduced for the next skimming, after we have confirmed that it was an error on our side not to include it in the numerator (and after testing the difference in a non-resonant TT production).
Even if the data-MC agreement was indeed good, I am not sure I agree that the effect of this scale factor is actually absorbed by our custom ttSF. The ttSF is a normalization factor computed on the mHH distribution, whereas this is a "shape weight" as a function of pT. Maybe we have been lucky, but this does not appear evident to me at first sight.

dzuolo commented 1 year ago

The instructions on the twiki seems to indicate to first check the data/MC in a tt dominated control region and then eventually compute a correction. So I would suggest to do this first.

portalesHEP commented 1 year ago

for (1): I think @dzuolo misquoted the twiki and we are in fact in the second bullet situation: " In case significant discrepancies are observed, a dedicated top pT reweighting function should be derived from this control region and applied across the analysis while monitoring the agreement of other distributions as a result of this reweighting. ". Our 'fault' here would then be that the custom correction that was derived was indeed not pT-dependant, but that does not change the conclusion that the correction provided on the twiki is not to be used.

jonamotta commented 1 year ago

Now I read better the TWiki, and I agree with @portalesHEP that the weight should be removed completely according to bullets 2 and 4 of case 3.1

What we could do is move to a pT-dependent computation of the ttSFs for the resonant analysis (or at least test it and see if that is in any way better than the normalization one we already have).

portalesHEP commented 1 year ago

What we could do is move to a pT-dependent computation of the ttSFs for the resonant analysis (or at least test it and see if that is in any way better than the normalization one we already have).

Agreed, but as @dzuolo said, before anything we should check if such weights are still needed

francescobrivio commented 1 year ago

I think the way to move forward for the resonant analysis is:

remove both topPtReweight and ttSF and check the data/MC agreement in the inverted resolved2b0j category (including the distribution of the top_pt)
- check with Top POG if the recommendations for this reweigthing has changed for UL samples
- only at that point decide if we want to apply the topPtReweight or not:
- if yes, apply it, check again the distributions and re-compute the ttSFs on top of this
- if no, the ttSFs should be re-computed in any case since the TT normalization is directly affected by this "bug"

For the EFT results based on the non-resonant analysis I will check with HH conveners to see if we want update this (which would require a significant effort) or if we want to base the EFT results exactly on the same HIG-20-010 analysis.

dzuolo commented 1 year ago

@portalesHEP @bfonta @kramerto @riga We need to decide how to proceed with this issue in the new ntuples: I would propose to remove topPtReweight from EvtW in the skimmers and then check the data/MC agreement in the inverted resolved2b0j. What do you think?

portalesHEP commented 1 year ago

Hi, I think we should keep it in the skimmer (but correct it), and just remove the -t option in the submission script (that would just set the weight to its default value of 1, like for any non-top sample). That way, if we do realise later on that we have some reason to put it back it'll be easier

dzuolo commented 1 year ago

I am not sure I understand your point Louis. The weight is already stored in the skims with the last value suggested by the POG in the m_TTtopPtreweight_up branch. What i am saying is that we should remove it from the computation of the EvtW which is the denominator of the normalization. I believe it should not be there if we do not apply the weight also in the numerator.

portalesHEP commented 1 year ago

I'm saying that instead of removing it from the denominator, we should add it to the numerator, but insure that for now the weight is set to 1 (which should be done by removing the -t flag in the skim submission iirc)

dzuolo commented 1 year ago

Ah ok! Now I understand, this seems ok to me!

kramerto commented 1 year ago

That is also how we submitted the 2017 Skims for now. Without the -t option, so no top pt reweighting and no tt stitching (which shouldn't be needed anyway since the samples we use have no overlap)

LLRCMS / KLUBAnalysis

Possible bug in the handling fo the normalization in the skimmers #273