TopEFT / topeft

15 stars 24 forks source link

Datacards, but faster! #287

Closed Andrew42 closed 2 years ago

Andrew42 commented 2 years ago

This PR introduces a new datacard maker re-built from scratch, meant to replace and speed up the datacard making step. I used fairly recent .root and .txt file produced by the old datacard maker as reference for the outputs. The bulk of the execution can be broken down into 3 main function calls: read(), get_selected_wcs(), and analyze().

In read() we feed the code the location of a .pkl file we would like to have converted into datacard output files. This function opens the pkl file and loads it entirely into memory. This function is also where we remove/group/scale/rename many of the sparse axes category names. In particular, the (de-)correlation of systematics over the different run years is handled at this time via the sub-function call to correlate_years().

Next is a small intermediary step where we pick only those WCs which have a significant yield contribution via at least 1 EFT quadratic term in at least 1 bin of the analysis. This selection is done per signal process and the results are passed along to the analyze() call to determine which WCs are kept during the EFT decomposition step. It should be noted that this step differs somewhat from how the old datacard functioned. Specifically, it performs the check of whether or not to keep a WC before any EFT decomposition has been done, which resulted in a more lenient inclusion of spurious WCs compared to the old datacard maker (using the same tolerance level). As a result, I bumped self.tolerance up by a factor of 10 (though I think a factor of 2 or 3 would have also been sufficient) in order to get the same selected WCs in my reference.

The final step is the analyze() call and is the meat of the of the code execution. It takes a kinematic distribution and channel category choice and then produces 1 set of datacard output files, i.e. a .root and .txt. We can then combine cards (via the combinecards.py script) from different analyze() calls together in the exact same way as is done currently.

The average runtimes I was getting on my personal laptop was roughly:

read(): 40-50 seconds
analyze(): 30-40 seconds # per call

With all other function calls having negligible timings. So, for 43 categories I would expect the total runtime to be roughly 25-30 minutes. However, early on in my testing I found out that running the exact same code on earth appeared to be 3-4x time slower, however, this was for a very early version of the code and I have not yet retried running on earth yet to see if I still get the same sort of slowdown. The good news is that I designed the analyze() call to be trivially parallelizable, meaning it should be a pretty simple task to tack on an async decorator to it and let it run multi-threaded (with max core/thread count limits!) on earth and hopefully get a nice linear scaling speed-up. The overwhelming majority of the time taken by the analyze() call is in writing the histograms out to the .root file, which I would hope our filesystem can handle multiple concurrent writes to a user's area. The other option would require adding some sort of wrapper script to create condor jobs that each run an executable that makes 1 (or maybe N) analyze() calls per job. This would avoid concerns with hammering the login node, but would also require running the read() command once per condor job, which takes longer than an individual analyze() call. There's of course other options beyond these two, but they would certainly be the simplest and most straightforward to implement.

There's still a number of important updates that are needed before we can fully switch over to the new datacard maker, which are all related to making datacards for the njets distribution. While I've implemented the histogram re-binning for the other kinematics distributions, I have not yet done it for njets and the other update is to include the missing_parton and jet_scale uncertainties as shape systematics (i.e. convert them into an njet histogram to save alongside all the other shape systematics in the .root file).

Andrew42 commented 2 years ago

I've re-organized some things. I renamed datacard_maker_v2.py to datacard_tools.py and moved it to the modules directory. I then created a new script called make_cards.py which implements the command line options we want to configure the datacard maker with. I also implemented most of the options that were in the old datacard maker and also added new ones. The missing options are:

Everything else should be fully implemented, though I haven't rigorously checked all the different possible option combinations

codecov[bot] commented 2 years ago

Codecov Report

Merging #287 (bd6a421) into master (431f09e) will decrease coverage by 4.81%. The diff coverage is 69.59%.

@@            Coverage Diff             @@
##           master     #287      +/-   ##
==========================================
- Coverage   29.37%   24.55%   -4.82%     
==========================================
  Files          46       46              
  Lines        7150     7034     -116     
==========================================
- Hits         2100     1727     -373     
- Misses       5050     5307     +257     
Flag Coverage Δ
unittests 24.55% <69.59%> (-4.82%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
analysis/topEFT/make_cards.py 60.13% <60.13%> (ø)
topcoffea/modules/datacard_tools.py 71.79% <71.79%> (ø)
topcoffea/modules/get_renormfact_envelope.py 0.00% <0.00%> (-75.95%) :arrow_down:
topcoffea/modules/corrections.py 0.00% <0.00%> (-35.47%) :arrow_down:
topcoffea/modules/objects.py 0.00% <0.00%> (-25.28%) :arrow_down:
topcoffea/modules/WCPoint.py 21.87% <0.00%> (-25.00%) :arrow_down:
topcoffea/modules/selection.py 0.00% <0.00%> (-6.96%) :arrow_down:
topcoffea/modules/remote_environment.py 0.00% <0.00%> (-6.82%) :arrow_down:
topcoffea/modules/GetValuesFromJsons.py 32.14% <0.00%> (-3.58%) :arrow_down:
topcoffea/modules/QuadFitTools.py 0.00% <0.00%> (-3.08%) :arrow_down:
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 431f09e...bd6a421. Read the comment docs.

bryates commented 2 years ago

I've re-organized some things. I renamed datacard_maker_v2.py to datacard_tools.py and moved it to the modules directory. I then created a new script called make_cards.py which implements the command line options we want to configure the datacard maker with. I also implemented most of the options that were in the old datacard maker and also added new ones. The missing options are:

  • --year I still need to check/figure out whether removing other years will cause issues for the cross-year decorrelation step
  • --job Not sure what a good analogue this should be
  • --asimov Didn't include this since we already have the unblind option
  • --do-sm Not sure what this option is supposed to do?

Everything else should be fully implemented, though I haven't rigorously checked all the different possible option combinations

Thanks! The --do-sm flag would just save the SM part and not the EFT components (lin, quad, and mix) but I don't think this option is super important. For --job, does this new version run on condor?

klannon commented 2 years ago

@Andrew42, can you run a head-to-head timing test of your new version against the existing one on the same hardware so that we have an apples-to-apples comparison? It doesn't have to be the whole analysis; you can test on a more manageable subset.

Andrew42 commented 2 years ago

@klannon I ran some timing tests on earth for a couple different scenarios and below are the results. It should go w/o saying, but all tests were performed using the same exact input file and I tried to make sure the settings used in both versions lined up as closely as possible.

The sections marked OLD REMOTE and NEW REMOTE are the most relevant as they refer to the results running the old vs. new datacard maker on earth, the section marked NEW LOCAL is what I get on my local laptop and is more just for reference. In total I did 3 tests, well 2 "test" tests and 1 real test, running over just one channel and only produced histograms for lj0pt. For the new card maker I was able to include some additional timing checkpoints, but I didn't try adding corresponding printouts to the old card maker (sorry).

TL;DR: The new datacard maker is orders of magnitude faster processing a single channel, using the same pkl file and same run settings. So much faster in fact that the new datacard maker could have realistically processed all of the channels used in the analysis, with enough time leftover to catch this week's episode for whichever Netflix series is the current rage (EDIT: Update just b/c I had to, instead of watching a Netflix show, you could alternatively watch the first Pirates of the Carribbean, though you might be cutting it a bit close here)! The exact timings were 5 minutes for the new datacard maker (50% of which was just the initial loading of the file) and 4hrs 6mins for the old datacard maker.

Timing checkpoint reference:
Pkl Open Time    -> The time to just unzip+load the histograms from the pkl file into memory

Total Read+Prune -> The time to load the pkl file, plus the time spent on manipulating the histograms
                    in preparation for the channel-by-channel processing. The "pruning" stuff includes:
                    dropping samples that aren't part of the actual signal fit (DY, single top, etc.);
                    scaling the yields by intg. luminosity (year-by-year); renaming+regrouping samples
                    (so like merging together WZ+WW+ZZ into "Dibosons"); Merging the samples year-by-year,
                    making sure to (de-)correlate year dependent systematics as necessary.

WC Select Time   -> The time it took to calculate which WCs to keep. Note, this is done once using
                    all signal channels and also enforced for all channels.

File Write Time  -> The time spent writing stuff to disk. This includes both writing the histograms to
                    the root file, as well as the time to write the text datacard part (though basically
                    99% of that time was taken up by generating the root file). It should be noted that
                    in a multithreaded execution that this is the time we would be able to save from
                    parallelization. I also printed out the total number of root histograms that got
                    written to file for reference.

Total Time       -> The time the entire process took beginning to end.

The first test (results below) restricted the card maker to 1 WC and all nuisance parameters. Here the old datacard maker didn't actually do too bad. It was roughly 2x slower, which isn't great, but not terrible either.

----------------------------------------------------------------------------------------------------
- One POI, all nuisances, one channel (2lss_p_4j)
----------------------------------------------------------------------------------------------------
nohup python analysis/topEFT/datacard_maker.py analysis/topEFT/histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --job 0 --do-nuisance >& out.log &
OLD COMMAND: python analysis/topEFT/datacard_maker.py analysis/topEFT/histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --job 0 --POI ctG --do-nuisance
NEW COMMAND: python make_cards.py histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --ch-lst 2lss_p_4j --POI ctG --do-nuisance

--- OLD REMOTE ---
Total Time: 306.62 s

--- NEW REMOTE ---
Pkl Open Time: 22.04 s
Total Read+Prune Time: 131.15 s
WC Selection Time: 0.35 s
File Write Time: 2.65 s
Total Hists Written: 1084
Total Time: 134.39 s

--- NEW LOCAL ---
Pkl Open Time: 7.52 s
Total Read+Prune Time: 34.79 s
WC Selection Time: 0.10 s
File Write Time: 0.79 s
Total Hists Written: 1084
Total Time: 35.76 s

The 2nd test ran over all POIs (so did the fully decomposition), but didn't include any systematics. I was sort of surprised by this one, as the old datacard maker took roughly the same amount time.

The new datacard maker is a good deal faster b/c I'm able to drop all the systematic bins from the systematic sparse axis right from the start, which makes all the "pruning" manipulations much quicker. As a side note and an interesting coincidence, the total number of histograms that get written is roughly the same in both tests, but it should be noted that the "types" of histograms in each output correspond to very different things!

----------------------------------------------------------------------------------------------------
- All POIs, no nuisances, one channel (2lss_p_4j)
----------------------------------------------------------------------------------------------------
OLD COMMAND: nohup python analysis/topEFT/datacard_maker.py analysis/topEFT/histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --job 0 >& out.log &
NEW COMMAND: python make_cards.py histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --ch-lst 2lss_p_4j

--- OLD REMOTE ---
Total Time: 327.04 s

--- NEW REMOTE ---
Pkl Open Time: 23.60 s
Total Read+Prune Time: 65.82 s
WC Selection Time: 0.23 s
File Write Time: 2.72 s
Total Hists Written: 1009
Total Time: 68.77 s

--- NEW LOCAL ---
Pkl Open Time: 7.56 s
Total Read+Prune Time: 16.61 s
WC Selection Time: 0.05 s
File Write Time: 0.70 s
Total Hists Written: 1009
Total Time: 17.36 s

The above 2 tests were mostly just to make sure I was able to run the old datacard maker and get my bearings set. Below are the results for the real test. Here I Included all POIs and all nuisance parameters. This corresponds to the individual unit of work that old the datacard maker split things into, before sending each unit condor.

The new datacard maker clocked in at just under 5 minutes, with more than half of that time spent just opening/manipulating the histograms (which only really needs to be done once). This means running everything in series on earth for 43 categories would take a bit over 1.5 hours, though @kmohrman tells me that we typically have to process 43 (for lj0pt) + 11 (from njets) + 8 (from ptz) = 62 categories, in which case the expected runtime would be more like 2-2.5 hrs (assuming no paralleization/cluster computing).

All that said, this still beats out the runtime for the old datacard maker running over a single category, which took just over 4 hours! @kmohrman tells me that when she makes the cards, typically the longest running jobs take around 3 hours, so its possible that the performance here is an outlier (I only ran the test once, but earth was pretty quiet), or it could just be that earth isn't as good and running this sort of thing for some reason.

----------------------------------------------------------------------------------------------------
- All POIs, all nuisances, one channel (2lss_p_4j)
----------------------------------------------------------------------------------------------------
OLD COMMAND: nohup python analysis/topEFT/datacard_maker.py analysis/topEFT/histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --job 0 --do-nuisance >& out.log &
NEW COMMAND: python make_cards.py histos/may18_fullRun2_withSys_anatest08_np.pkl.gz --var-lst lj0pt --ch-lst 2lss_p_4j --do-nuisance

--- OLD REMOTE ---
Total Time: 14735.83 s (4hrs 6mins)

--- NEW REMOTE ---
Pkl Open Time: 31.00 s
Total Read+Prune Time: 149.02 s
WC Selection Time: 0.54 s
File Write Time: 131.09 s
Total Hists Written: 53395
Total Time: 280.94 s

--- NEW LOCAL ---
Pkl Open Time: 7.53 s
Total Read+Prune Time: 35.52 s
WC Selection Time: 0.13 s
File Write Time: 34.12 s
Total Hists Written: 53395
Total Time: 69.84 s
klannon commented 2 years ago

@Andrew42: That's super interesting. So it seems like you've got about a factor of 50 speed up comparing earth to earth for new vs old, and if I'm reading it right, the remaining time is dominated by I/O. Your laptop is another factor of 4 faster compared to earth. When you're running on earth, are you reading from and/or writing to AFS? Have you checked to see what happens if you use /scratch365 or /tmpscratch? On your laptop, do you have an SSD?

Andrew42 commented 2 years ago

Running on earth, everything was done from my AFS area. As for the large difference between earth and my laptop I chalked it up to SSD vs. HDD (at least I assume the AFS area uses HDDs?).

It wouldn't be too much work to have the new datacard maker output the results to /scratch365 or /tmpscratch (it actually reminded me that I wanted to add a configurable output directory option at some point).

Andrew42 commented 2 years ago

Ok, so I tried running (also reading the input) from both /tmpscratch and /scratch365 and unfortunately didn't really see any significant improvements.

@klannon, did you expect to see improved performance using these storages and if so does this indicate that the bottlekneck might be somewhere else? Would the main benefit of writing to /scratch365 be that we wouldn't see as much slow down due to many simultaenous writes or something?

--- Using tmpscratch ---

COMMAND: 
python make_cards.py /tmpscratch/users/awightma/datacard_testing/may18_fullRun2_withSys_anatest08_np.pkl.gz \
    -d /tmpscratch/users/awightma/datacard_testing --var-lst lj0pt --ch-lst 2lss_p_4j --do-nuisance

Pkl Open Time: 22.66 s
Total Read+Prune Time: 139.20 s
WC Selection Time: 0.53 s
File Write Time: 122.80 s
Total Hists Written: 53395
Total Time: 262.81 s

--- Using scratch365 ---

COMMAND:
python make_cards.py /scratch365/awightma/datacard_testing/may18_fullRun2_withSys_anatest08_np.pkl.gz \
    -d /scratch365/awightma/datacard_testing --var-lst lj0pt --ch-lst 2lss_p_4j --do-nuisance

Pkl Open Time: 22.82 s
Total Read+Prune Time: 136.12 s
WC Selection Time: 0.52 s
File Write Time: 125.52 s
Total Hists Written: 53395
Total Time: 262.43 s
bryates commented 2 years ago

Thanks for all the careful testing @Andrew42. In general, I always saw better performance on earth when using condor vs running locally, but your new version is still much faster!

Andrew42 commented 2 years ago

@bryates thanks for approving the PR, but I think we should still hold off for a bit. There's still at least one major part missing, which is the creation of the shape variants for the Diboson jet_scale and tllq/tHq missing_parton uncertainties when producing datacards for njets.

I'd also like to do a more complete comparison between the old and new datacard maker. I did some fairly thorough consistency checking when writing the code, e.g. checking the exact same set of histograms appears in the root file and then also checking that the summed bin contents of every histogram matches, and checking by hand that the bin contents of randomly chosen histograms matches exactly. However, the text datacard part isn't as easy to compare, since the column ordering differs between the two outputs, also trying to do a diff between files that are very "wide" like the text datacards is problematic.

I suppose a potentially more straightforward way to compare the outputs would be to run combine on outputs produced by each of the datacard makers and see if there's any noticeable difference in the results.

Finally, my time is going to be very limited this week and next week, so I don't think I'll be able to make very much progress on these topics on a timescale that would be useful unfortunately.

Note: While the shape variants of the above 2 systematics are missing, they are properly included as their rate version when producing cards for any distribution other than njets.

Andrew42 commented 2 years ago

I managed to get the condor submission implemented into the make_cards.py script.

From the user side of things all you should have to do is add the command line option --condor and it will automatically switch over. This also means that the batch_make_cards.sh script should no longer be needed and I vote that we delete it from the repo.

An important thing to note about the condor jobs is that for right now they assume that whatever output directory you specify is mounted and available on the remote machine running the condor job. This means your output directory should be limited to either scratch365 (recommended) or somewhere in your afs space that the condor jobs would have write permissions to.

Another thing to note is that you don't need to specify --do-nuisance when running condor submissions, as the condor jobs are currently forced to always run with the systematics included.

There's also another new option --chunks where you can specify an integer number and the script will try to make the condor jobs process that many channels per job. The default is set to 1 and since there are currently only 50 jobs or so that get created, you probably won't need to use this option, unless you start to see a significant fraction of the jobs sitting in the IDLE state.

Finally, based on my test runs, the full set of datacards get generated in 5-10 minutes. A significant speed-up from the 3 hours or so when we started this rewrite!

klannon commented 2 years ago

This is really phenomenal. Just for educational purposes, could you briefly list what were the leading factors contributing to this speedup. I know it was a complete rewrite and so many things changed, but presumably only some of those changes impacted performance. If you know what the biggest impacts were, that would be helpful to know. If you don't know, then don't worry about it. I'm just curious!

Andrew42 commented 2 years ago

This is really phenomenal. Just for educational purposes, could you briefly list what were the leading factors contributing to this speedup. I know it was a complete rewrite and so many things changed, but presumably only some of those changes impacted performance. If you know what the biggest impacts were, that would be helpful to know. If you don't know, then don't worry about it. I'm just curious!

@klannon, It's hard to say exactly where all the speed-up came from, but it is safe to say that the biggest factors were things like the fact that the old datacard maker did a lot of unnecessary looping over the same data many many times, e.g. here and also wrote (and subsequently read) intermediate results to files, which means a lot of excessive file I/O. Even now, the biggest time sink in the new datacard maker is file I/O, but we just make sure to read and write things exactly once.

Other than that, I guess I would say that I took extra care to try to identify where the heavy loops are and make sure we do them only once. I also made an effort to make sure to do array manipulations using numpy, e.g. here and here. Finally, I also implemented some pre-processing that attempted to cut down on the total number of sparse keys (e.g. summing over years, merging together di/triboson samples, etc.) in order to reduce the amount of work needed later on.