Read GenPart collection directly

ktht commented 2 years ago

This PR is a major refactoring of the previous code, where we read different gen particle collections for various tasks.

I'll add a longer commentary once I'm done with all the checks.

ktht commented 2 years ago

When we first migrated to NanoAOD, I implemented the splicing in nanoAOD-tools for the following two reasons:

that's what the old (VHbb-era) FW assumed;
it cost a lot of memory to transcribe the GenPart collection in the old FW, which is why it was dropped.

The last point is now moot, because we're not transcribing full collections to the output Ntuple anymore.

I think we can do better if we close the gap between vanilla NanoAOD and the post-processed NanoAOD Ntuple that we need as input to this FW. The post-processing module that splices the GenPart collection has a couple of major drawbacks:

it's inefficient and poorly written (by me);
it does not have the flexibility we need: if there's a gen particle collection missing, a new round of post-production has to be run, which consumes both human and computing time, adds more unnecessary delays etc.

For this reason I took this weekend off and decided to modify the code such that it only knows about the full GenPart, GenJet and GenVisTau collections that are available by default in NanoAOD and nothing else. This way we can get rid of the genParticleProducer module. Here is the list of changes that it took to get me there:

perform index-based matching for the following cases:
- reco electrons to gen electrons, photons and tau decay products;
- reco muons to gen muons and tau decay products;
- reco had taus to gen electrons, muons;
- reco jets to gen jets;
in all other cases, use dR-based gen-matching;
compute Higgs decay mode from generator-level Higgs bosons, and save it to the output file with EvtInfoWriter, which is a new writer plugin. Module genHiggsDecayModeProducer would be no longer needed;
implemented top pT reweighting (not tested), so that we can eliminate countHistogramProducer post-processing module. The corrections are shape-changing anyways, so we don't have to factor in these SF when normalizing the samples. Judging from the event counts we currently have in the old FW, the effect of including them on the event counts is less than per-mille;
determine the gen photon candidates and proxy photons, which are needed for applying gen photon filter (not tested);
added compilation flags that optimize the code (-O3) and enforce good coding practices (eg -Wshadow, which actually pointed me to a couple of such errors in the code);
got rid of massive amount of junk and boilerplate that was associated with gen-matching and gen Higgs decay modes.

A few problems cropped up while debugging and testing the code:

the memory consumption seems to correlate with the number of processed events. This was already the case before I edited any code. We nned to understand if it's a genuine leak or some kind of ROOT buffering / caching issue;
in order to completely eliminate countHistogramProducer module from the list of post-processing modules, we need to drop LHEEnvelopeWeight* branches from LHEInfoReader, and compute the envelope of all QCD scale variations after the event selection (outside of this particular FW). This would be more correct than the current implementation, where we build the envelope by applying the maximum / minimum LHE scale weight (stored in LHEInfoReader) to each event. The analysis FW would need to compute shapes for MC closure anyways, so it's not a tall order to add another method that computes the final shape template for the QCD scale variation. I did not remove the relevant code from here because I wanted to inform you first, plus it goes outside of the scope of this PR.

All genMatch branches that are written to the output Ntuple are identical to what we had before. After testing my changes on the full HH signal sample, which contains 400k events, I confirm that it has negligible impact on the memory consumption (991MB -> 1.05GB), but the runtime increased by nearly 50% (1729s -> 2687s). After optimizing the code a little bit, I was able to reduce the runtime to about 2500 s (lost the log file already). The main culprit as to why the runtime blew up is because of these lines: https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/73df512997d079a2f01d7cfb8ae4c8375bf5843e/Readers/src/EventReader.cc#L583-L588 where we match reco jets to gen leptons, taus and jets. If I disable these lines, it's actually running faster than before (~1600s). I don't see a good physics reason for matching reco jets to anything, because we're not using this information in any shape or form. We could still retain this functionality by putting it under some boolean flag that's configurable, but disabled by default. What do you think?

If this PR is merged, I think we might be one inch closer to phasing out nanoAOD-tools. The only remaining modules are jet-related: b-tagging SF and its uncertainties, and JES/JER corrections plus uncertainties on the jet and MET objects. It should be feasible, though.

veelken commented 2 years ago

Hi Karl,

I agree that we can put the code https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/73df512997d079a2f01d7cfb8ae4c8375bf5843e/Readers/src/EventReader.cc#L583-L588 under some boolean flag that's configurable in python (such that the code that matches reconstructed jets to generator-level electrons, muons, and taus is not executed by default).

Just for clarification: if I understand correctly, the splitting of the GenParticle collection in the TallinnNtupleProducer takes about 800 seconds per 400k events. Not executing the code for the jet matching saves about 1000 seconds per 400k, so the code runs faster than before. But the 2 changes are unrelated, i.e. executing the code for the jet matching just by chance takes about the same time as splitting the GenParticle collection, right ? I don't see how the jet matching would be related to splitting the GenParticle collection.

ktht commented 2 years ago

Hi Christian, the elapsed times I quoted are for running produceNtuple before and after the changes of this PR. The input Ntuple is the same: it contains both spliced GenPart collections as well as the GenPart collection itself. (Remember, in the old FW we dropped the GenPart collection because it was not possible to transcribe it without exceeding the memory consumption.)

We don't actually "split" the gen particle collection but keep track of a single collection for each event (event_.genParticles_), from which we read the particles necessary for determining gen Higgs decay modes or applying top pT reweighting weights. In practice, this splicing has negigible impact on the performance.

So, it really cost about 1000s per 400k events to run gen-matching on reco jets. I think this is because there are many jets, gen particles and jet systematics. I've now made jet gen matching optional and disabled it by default. The code runs a few percent faster than before.

HEP-KBFI / TallinnNtupleProducer

Read GenPart collection directly #12