Wish-List: Allowing sub-categories in datacard production

rdewanje commented 3 years ago

Currently none of the HH Multi-lepton channels can incorporate sub-categories in their datacards. The following are some of the options that can be pursued to do the same:

run one workflow per sub-category;
devise a "datacard mode" which saves only the histograms for datacard production and nothing else.

acarvalh commented 3 years ago

That is already solved in bbww and tth analysis. There we run the workflow once and have histograms / category. Take a look if you would like to follow:

1) booking histograms -- follow how this function is used on the correspoondent analysis code. Specially here to define the names to give to the categories and here to construct the name for each evennt.

2) an inverse "datacard mode" flag flag is here, is used if we want additional data/MC plots. if it is false it will do the additional histograms booked on the continuation of the above-mentioned function also by category

veelken commented 3 years ago

Hi Xandra,

this code is deprecated. It is still kept in order not to disrupt the ongoing Run 2 HH->bbWW and HH->multilepton analyses. The new code to implement categories is here:

https://github.com/HEP-KBFI/hh-multilepton/blob/master/src/DatacardHistManager_hh.cc
https://github.com/HEP-KBFI/hh-bbww/blob/master/src/EventCategory_hh_bb2l_BDT.cc
https://github.com/HEP-KBFI/hh-bbww/blob/master/python/configs/analyzeConfig_hh_bb2l.py#L528-L552 (the 3rd link is to split the ROOT files produced by the analysis code into multiple parts, one part per category, in order to avoid that the hadd jobs become slow and consume a lot of memory)

ktht commented 3 years ago

Have we actually ever had subcategories in any of HH->multilepton analyses, at all? The subcategories are listed from the python side of things in 2lss, 2l+2tau, 3l and 3l+1tau analyses, but they are never propagated to the analysis jobs, nor do the jobs themselves construct these subcategories. The analysis code seems to be completely oblivious to subcategories.

Do we have any reason to believe that splitting by subcategories in any of the aforementioned channels gains in sensitivity? What's the minimal set of subcategories then?

We have to be a bit more concrete with our wish-list because we cannot afford (both human time and computing-wise) to implement something that's just a "nice-to-have" feature. The analysis jobs have to adhere certain limits set by our computing infrastructure. It makes no sense to populate histograms for as many histograms as was previously defined in eg 2lss analysis: https://github.com/HEP-KBFI/hh-multilepton/blob/1d9a9a58b8e5da687942601ac575c7c6db2004af/python/configs/analyzeConfig_hh_2lss.py#L146-L151 But it is worth considering to split 2lss by eg lepton flavor into 3 subcategories if there are arguments that support it.

ktht commented 3 years ago

Ah now I see, the subcategories were created at the analysis level, but the code was dropped after migrating to the new datacard manager (see this commit for example: 4464246a4c8132df211e95d2533f3a344423ae6c).

We have seen in the past that adding a single subcategory basically asks as much memory in addition to what the job already consumes. The most demanding jobs are central jobs that also take into account the acceptance uncertainties. Here's the memory consumption of these jobs on a HH->2V2T signal sample, and an optimistic estimate for the max number of subcategories when running with full systematics:

2lss: ca 500MB -> 3 + inclusive;
2l+2tau: ca 400MB -> 4 + inclusive;
3l: ca 350MB -> 4 + inclusive;
3l+1tau: ca 350MB -> 4 + inclusive.

That's by including other histograms that are not particularly necessary for datacard production. So I guess we could drop those if really needed. The other option of creating one job per subcategory is not going to fly, because it generates too many analysis and hadd jobs, and significantly increases the turnaround time.

HEP-KBFI / hh-multilepton

Wish-List: Allowing sub-categories in datacard production #28