Skim events by category

Yet another complementary approach to combat the issue described in https://github.com/HEP-KBFI/tth-htt/issues/59 is to go a bit more specific with the event preselection. One way is to introduce integer variables to the Ntuples that hold the following information:

number of loose, fakeable and tight electrons and muons;
number of hadronic taus that pass each WP associated with a given tau ID discriminant, found over all tau ID discriminants that we're interested in.

The argument why we may want to condense the multiplicity of the objects passing certain criteria into integer variables is that then we can make the decision of rejecting the event early without reading and construction all those massive Reco* objects. The counterargument to this approach is that the jobs are already bound by I/O and the potential gain is a bit unclear. I also think that these branches should be added in the post-processing step and not in the skimming step (otherwise we won't be able to use Ntuples w/o those branches in our analysis) but the post-production of 2017 Ntuples is almost done and it's very expensive to run these jobs again.

The second way to go about this is by creating preskimmed TTrees in the Ntuples, one TTree corresponding to the multiplicity signature of each channel. For instance, TTree named Events_2lss would contain all events that have two (or more) fakeable leptons. The downside is that the Ntuple files are going to be huge compared to the first approach, which probably affects I/O negatively.

I'll give it a lower priority than https://github.com/HEP-KBFI/tth-htt/issues/60 because the gain is not well quantified (plus it varies wildly from channel to channel).

Added the functionality of reading the branches that hold the number of electrons and muons passing loose, fakeable and tight selection, and the number of hadronic taus that pass any WP of MVAv2 dR03, MVAv2 dR05 and DeepTau (vs jet) ID.

The mechanics is the following: lower bound on the multiplicity of the leptons is applied such that the looser cut of electron and muon selection is first taken. This ensures that we don't cut away events from MC closure region where one lepton cut is kept tight while the other is loosened. The upper bound on the number of leptons is found by requiring the number of tight leptons to be less than what is expected from the final state (e.g. no more than 3 tight leptons in 3l+N tau channel). The number of electrons is found from cleaned electron collection, so placing an upper bound on the multiplicity of tight leptons won't cut away any extra events. This, however, is not true for hadronic taus, because when counting the number of hadronic taus the pT cut was lowered from the nominal 20 GeV to 18 GeV. Therefore, only the lower bound on the number of hadronic taus can be really applied.

Implemented in all analyses concerning event categories (including HH analyses) but not in auxiliary analyses. Tested in a handful of channels and regions (2lss SR, 2lss+1tau SR, 1l+2tau SR, 1l+2tau fake AR, 1l+2tau MC closure of electrons). I need to test the solution in all channels, though, by producing new sync Ntuples and compare the event yields to the event yields from the previous sync Ntuple. If there are no changes, then it confirms that the current solution is compatible with the event selection. Also, since 2017 Ntuples don't have the multiplicity branches (yet), I disabled the cut by default.

Unfortunately, cursory testing indicates that applying cuts on the multiplicity variables improves the performance very little or not at all. In some cases I saw 10-15% speedup, while in other cases there was no speedup (or even the opposite happened). So, the multiplicity branches are not the silver bullet in skimming. I'll make it the default, though, if I double-confirm that the event selection hasn't changed due to the new cuts.

I studied the 2018 MC Ntuples a bit to see how many events we could cut away with these basic cuts: https://github.com/HEP-KBFI/tth-htt/blob/532e73f08ea4d8416cbf3f6d821ccac2376a6f04/test/tthProdNtuple.py#L157-L171 These cuts should be safe enough to study any analysis channel. So I counted the number of events in the recently added multiplicity branches to determine, how many events we would select if we require each event to have at least two fakeable electrons, muon or hadronic taus that pass VLoose WP of 2017v2 MVA or DeepTau ID discriminant, and got the following results (full results are here):

TOTAL             11.9% 12.8% 2519062049
TOTAL (whitelist) 13.2% 14.1% 1000631750

The first row shows that if we were to use all samples, we would cut away about 88.1% (87.2%) of 2.5B MC events if the hadronic tau is required to pass VLoose WP of 2017v2 MVA (DeepTau) discriminant. However, we really don't need to use all those samples in the analysis as some are reserved for the BDT training (where we have to actually relax the cuts to increase the acceptance) and some are relevant only in HH analysis. So, only keeping the common samples (amounting to 1B events), the reduction is about 86.8% (85.9%) when requiring the hadronic taus to pass VLoose WP of 2017v2 MVA (DeepTau) ID.

Assuming that there's a significant overlap between the hadronic taus that pass the VLoose WP of both discriminants, I estimate that applying these basic event selection requirements will reduce the number of events 5 times. I'm right now trying to see how many events we'll cut away in data but cursory results indicate that it's also in the same ballpark.

What I'm trying to convey here is that it's probably not good idea to implement special TTree for each event category, as the final Ntuples explode in size (each event takes about 4kb of storage -- totaling to 18TB worth of Ntuples -- per era!) and it will increase the complexity of an already complex workflow. One simple skim should be enough.

Two updates:

when selecting the events by reading the multiplicity branches, the event yields of 2017 sync Ntuple didn't change -> the cut is safe to use by default;

simple skimming can reduce the number of data events by 80-97% (depending on the PD):

DoubleMuon_Run2018A_17Sep2018_v2          17.7%  17.9% 70934237
DoubleMuon_Run2018B_17Sep2018_v1          19.9%  20.1% 32173507
DoubleMuon_Run2018C_17Sep2018_v1          18.6%  18.7% 33794567
DoubleMuon_Run2018D_PromptReco_v2         17.9%  18.1% 162909995
SingleMuon_Run2018A_17Sep2018_v2           6.2%   6.5% 227489240
SingleMuon_Run2018B_17Sep2018_v1           6.4%   6.7% 110446445
SingleMuon_Run2018C_17Sep2018_v1           6.4%   6.8% 107972995
SingleMuon_Run2018D_PromptReco_v2          6.4%   6.7% 492637435
Tau_Run2018A_17Sep2018_v1                  4.6%   7.2% 59503851
Tau_Run2018B_17Sep2018_v1                  5.0%   7.4% 29788612
Tau_Run2018C_17Sep2018_v1                  5.3%   7.6% 31338906
Tau_Run2018D_PromptReco_v2                 5.1%   7.5% 162352731
MuonEG_Run2018A_17Sep2018_v1               2.6%   2.9% 31249788
MuonEG_Run2018B_17Sep2018_v1               2.7%   3.1% 14454733
MuonEG_Run2018C_17Sep2018_v1               2.6%   3.0% 15363987
MuonEG_Run2018D_PromptReco_v2              2.6%   2.9% 70006284
EGamma_Run2018A_17Sep2018_v2               2.4%   2.6% 308652991
EGamma_Run2018B_17Sep2018_v1               2.8%   3.0% 139144140
EGamma_Run2018C_17Sep2018_v1               2.6%   2.9% 143781609
EGamma_Run2018D_PromptReco_v2              2.4%   2.6% 708703289
TOTAL             5.5% 6.0% 2952699342

So, we end up with about 320M events in total (both data and MC) after the skimming, or ~1B events over all three eras. Channels like 0l+2tau and 1l+1tau may require binned DY and W+jets samples, so the number of skimmed events may be a bit higher.

I ran the event preselection overnight on 2017 and 2018 Ntuples (2016 still incomplete) and obtained the actual reduction of events for each sample and for the complete era. Here and here are the details. The total reduction is

2017: 9.76% (4852138581 or 4.85B -> 473341078 or 473.3M)
2018: 8.60% (5595318063 or 5.60B -> 481092625 or 481.1M)

And if we consider only the samples that are actually used in the analysis (modulo binned DY and W+jets which are needed in 0l+2tau and 1l+1tau channel but are disabled by default):

2017: 9.68% (3139591593 or 3.14B -> 303830145 or 303.8M)
2018: 7.66% (3997796995 or 4.00B -> 306205196 or 306.2M)

So the gain is ~10x compared to the initial event counts. These skimmed Ntuples can be used in any analysis that don't require loose leptons. so analyses producing Ntuples for BDT training are out of the question, as well as charge flip measurement and lepton fake rate measurement.

The disk space consumed by the Ntuples also reduced almost 10x: from 19.7TB to 2.5TB in 2017 and from 17.7TB to 2.1TB in 2018.

I could skim HH multilepton samples because the channels are compatible with the event preselection cuts applied in this (ttH) analysis, but there are only 45M HH events in total (over all eras). However, for HH bbWW analysis we would need to adapt different preselection cuts because of channels like bb+1l where the event is required to contain a single lepton (as opposed to at least two electrons muon or hadronic taus). It would also imply a complementary preselection of ttH samples as well. Since there are only 88M events in bbWW analysis in total, I decided not to skim any HH events right now.

The event yields in each analysis channel and region remained unchanged after switching to skimmed signal samples. For this reason, I now made the skimmed samples the default in all analysis channels in ttH and HH multilepton repositories. The preselected samples are also used in charge flip measurement, because the selected leptons are required to pass tight lepton selection.

HEP-KBFI / tth-htt

Skim events by category #61