Migrate to v5 skims, plus some clean-ups

sam-may commented 3 years ago

Includes relevant changes to run on v5 skims, plus some cleaning up of the code, including:

Load photons directly from pre-calculated diphoton pairs, selectedPhotons. For some reason, awkward does not automatically load this as a record so that we can do e.g. selectedPhotons.pt. So, I added a function in photon_selections.py to do this manually. Related, selectedPhotons seemingly needs to be loaded separately from each file -- otherwise, awkward creates 2d arrays of other branches like gg_mass. I don't entirely understand why this is needed, but it works.
Merge selection of events and calculation of relevant event-level variables. For example, photon and diphoton variables are now computed inside diphoton_selections.py : diphoton_preselection (since this should be the same for every H->gg analysis) and lepton/tau/jet variables are computed inside respective functions analysis_selections.py (since this can differ between analyses)
Add saving of jet-related variables for HHggTauTau_InclusivePresel -- these should be relevant for rejecting tt+X and ttH in BDT training.
Add trigger requirements explicitly for data: this cuts an additional 1% of data events (i.e. 1% of data events that pass diphoton preselection do not pass trigger). We should check whether this is expected or if it implies some minor missing cut in diphoton preselection.
Add HH->ggZZ sample (still need to add HH->ggWW, as well as a few other backgrounds like QCD, Diboson, etc)
Add relevant code for running ttH preselection

The HHggTauTau_InclusivePresel can be run with the following command: /bin/nice -n 19 python loop.py --selections "HHggTauTau_InclusivePresel" --nCores 24 --debug 1 --options "data/HH_ggTauTau_default.json" --output_tag <your_output_tag>

If you run on just data and signal (by adding --select_samples "Data,HH_ggTauTau"), it should take less than 10 minutes: [LoopHelper] Total time to run 77 jobs on 24 cores: 7.85 minutes

And the ttH Leptonic preselection can be run with: /bin/nice -n 19 python loop.py --selections "ttH_LeptonicPresel" --nCores 24 --debug 1 --options "data/ttH_Leptonic.json" --samples "data/samples_and_scale1fb_ttH.json" --output_tag <your_output_tag>

mhl0116 commented 3 years ago

For some reason, awkward does not automatically load this as a record so that we can do e.g. selectedPhotons.pt. So, I added a function in photon_selections.py to do this manually. Related, selectedPhotons seemingly needs to be loaded separately from each file -- otherwise, awkward creates 2d arrays of other branches like gg_mass. I don't entirely understand why this is needed, but it works.

can you give examples of the failure cases? this worries me a little, as how do we make sure we can always catch this small caveat when writing the code, one careless mistake would make the downstream behavior (bug) difficult to catch.

sam-may commented 3 years ago

If I load the branches all together (i.e. don't load photons separately), I get File "/home/users/smay/Hgg/HggAnalysisDev/Preselection/selections/selection_utils.py", line 15, in add_cuts n_events_cut = len(self.events[cut]) File "/home/users/smay/Hgg/HggAnalysisDev/env/lib64/python3.6/site-packages/awkward/highlevel.py", line 1005, in __getitem__ return ak._util.wrap(self._layout[where], self._behavior) MemoryError: std::bad_alloc

From debugging, I found that this was due to awkward creating 2d arrays of gg_* variables. If I load photons separately, this doesn't happen.

For your concern "how do we make sure we always catch this?", aside from the fact that the code crashes if we don't do things this way, this is implemented directly in photon_selections.py and in the load_file function. This is shared for all analyses (so far just ttH and HH->ggTauTau), so someone would have to explicitly change this to reintroduce the bug.

Given that Leonardo reproduced exact event counts with the new code, I would say this is "weird" but not concerning.

mhl0116 commented 3 years ago

no, I was not worried about this particuar feature, I mean more about other small details like this related to columnar tools, maybe it's a matter of getting used to it. if something is crashing then it's great to catch problem, the harder one is that related to number doesn't make sense but one won't notice without x-check on related numbers

but anyway, just a small note/concern that doesn't affect the validity of this PR

sam-may commented 3 years ago

Right, this is a good point. I think one solution for this would be to define a unit test, as you suggested a while ago.

I would think that once we sync the ttH yields a bit more closely, we could use the ttH Leptonic preselection as the unit test.

cmstas / HggAnalysisDev

Migrate to v5 skims, plus some clean-ups #7