HEP-KBFI / TallinnNtupleProducer

code, python scripts and config files for producing "plain" Tallinn Ntuples
3 stars 2 forks source link

Read GenPart collection directly #12

Closed ktht closed 2 years ago

ktht commented 2 years ago

This PR is a major refactoring of the previous code, where we read different gen particle collections for various tasks.

I'll add a longer commentary once I'm done with all the checks.

ktht commented 2 years ago

When we first migrated to NanoAOD, I implemented the splicing in nanoAOD-tools for the following two reasons:

The last point is now moot, because we're not transcribing full collections to the output Ntuple anymore.

I think we can do better if we close the gap between vanilla NanoAOD and the post-processed NanoAOD Ntuple that we need as input to this FW. The post-processing module that splices the GenPart collection has a couple of major drawbacks:

For this reason I took this weekend off and decided to modify the code such that it only knows about the full GenPart, GenJet and GenVisTau collections that are available by default in NanoAOD and nothing else. This way we can get rid of the genParticleProducer module. Here is the list of changes that it took to get me there:

A few problems cropped up while debugging and testing the code:

All genMatch branches that are written to the output Ntuple are identical to what we had before. After testing my changes on the full HH signal sample, which contains 400k events, I confirm that it has negligible impact on the memory consumption (991MB -> 1.05GB), but the runtime increased by nearly 50% (1729s -> 2687s). After optimizing the code a little bit, I was able to reduce the runtime to about 2500 s (lost the log file already). The main culprit as to why the runtime blew up is because of these lines: https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/73df512997d079a2f01d7cfb8ae4c8375bf5843e/Readers/src/EventReader.cc#L583-L588 where we match reco jets to gen leptons, taus and jets. If I disable these lines, it's actually running faster than before (~1600s). I don't see a good physics reason for matching reco jets to anything, because we're not using this information in any shape or form. We could still retain this functionality by putting it under some boolean flag that's configurable, but disabled by default. What do you think?

If this PR is merged, I think we might be one inch closer to phasing out nanoAOD-tools. The only remaining modules are jet-related: b-tagging SF and its uncertainties, and JES/JER corrections plus uncertainties on the jet and MET objects. It should be feasible, though.

veelken commented 2 years ago

Hi Karl,

I agree that we can put the code https://github.com/HEP-KBFI/TallinnNtupleProducer/blob/73df512997d079a2f01d7cfb8ae4c8375bf5843e/Readers/src/EventReader.cc#L583-L588 under some boolean flag that's configurable in python (such that the code that matches reconstructed jets to generator-level electrons, muons, and taus is not executed by default).

Just for clarification: if I understand correctly, the splitting of the GenParticle collection in the TallinnNtupleProducer takes about 800 seconds per 400k events. Not executing the code for the jet matching saves about 1000 seconds per 400k, so the code runs faster than before. But the 2 changes are unrelated, i.e. executing the code for the jet matching just by chance takes about the same time as splitting the GenParticle collection, right ? I don't see how the jet matching would be related to splitting the GenParticle collection.

ktht commented 2 years ago

Hi Christian, the elapsed times I quoted are for running produceNtuple before and after the changes of this PR. The input Ntuple is the same: it contains both spliced GenPart collections as well as the GenPart collection itself. (Remember, in the old FW we dropped the GenPart collection because it was not possible to transcribe it without exceeding the memory consumption.)

We don't actually "split" the gen particle collection but keep track of a single collection for each event (event_.genParticles_), from which we read the particles necessary for determining gen Higgs decay modes or applying top pT reweighting weights. In practice, this splicing has negigible impact on the performance.

So, it really cost about 1000s per 400k events to run gen-matching on reco jets. I think this is because there are many jets, gen particles and jet systematics. I've now made jet gen matching optional and disabled it by default. The code runs a few percent faster than before.