cms-analysis / CombineHarvester

CMSSW package for the creation, editing and analysis of combine datacards and workspaces
cms-analysis.github.io/CombineHarvester/
15 stars 180 forks source link

Merge processes at runtime #267

Open pkausw opened 2 years ago

pkausw commented 2 years ago

Suppose you have a datacard with a process class with some sub-processes. For example, in our analysis we have a data-driven control region (CR) that is derived as the difference between data_CR and some other processes proc1_CR and proc2_CR. Since the complete process in the CR is estimated during the fit, the processes need to be considered individually in the datacard for a correct parameterization. However, when interpreting the results and plotting everything, we're only interested in the total process in the CR and would like to correctly account for the post-fit correlations between the individual *_CR processes in order to derive sensible uncertainty bands.

This PR introduces an option to PostFitShapesFromWorkspace to merge processes at runtime. The processes can be defined with the syntax NAME='expr', where expr is the regular expression that is parsed to the process filter of the harvester instance.

Additionally, the PR introduces an option for additional printout during the calculation of the post-fit uncertainty bands. If activated, the option will print out the difference between the bin content of the nominal post-fit template and each toy used for the construction of the uncertainty band, for each bin respectively. This functionality can be used to understand abnormally large fluctuations in the uncertainty band and is intended for debugging purposes.

pkausw commented 1 year ago

Hi, are there any updates regarding this PR? I think it might be related to issue #291

kcormi commented 1 year ago

Thanks for this!

I had a few inline comments to the code, another couple things here:

Firstly, perhaps it is possible to simplify the code by packaging the loop over processes (or merged processes) into a function which is reused. Since now we have two loops over prefit (processes, then merged processes) and two loops over postfit (same as before). They are handling slightly different objects and with slightly different treatment for pre- and postfit-, so maybe its not so simple; but much of the code is replicated and it might be nice to compactify this.

Secondly, I think we would also want to add the merged processes into the loop over processes when the postfit/prefit yields are printed, right? https://github.com/cms-analysis/CombineHarvester/pull/267/files#diff-850d8a5178aa5366c05bacce2ec563fa72bfa4ce6ef5509f33b134d0dae89e5eL442

ajgilbert commented 1 year ago

Hi, I remembered I made some related changes a while ago, and never merged them. See #295 which I just opened. That adds the option to merge channels (not processes), so is in principle complementary.

I tend to agree with @kcormi that the changes to GetShapeWithUncertainty take some care - especially not to run any excess code if verbose is not true. I see the motivation to have access to the variations though for debugging. To make it more flexible, we could see if instead of building a big string internally with some arbitrary thresholds involved, we could instead add a way to return the full set of varied TH1s, saved to the output, which the user would be free to inspect with their own code.

pkausw commented 1 year ago

Hi, thanks for the review and the comments! Indeed, I did most/some of these changes in the meanwhile in my local setup - I'm in the process of getting my local repository in shape to update this PR asap.

Regarding the additional debugging information: Yes, totally agree that this should be completely optional (which it is in my local version of the Harvester at this point). @ajgilbert , your comment about saving the toys themselves instead of a very long string is excellent I think - I'll have a look at this as soon as this PR is updated.

pkausw commented 1 year ago

Hi! I am finally done with the update for this PR. In particular, I

Unfortunately, the PR is now a little messy since I changed quite a few things in one go (sorry). Let me know if you'd like me to change something more, sorry for the mess!

ajgilbert commented 1 year ago

Thanks, this now has a lot of changes :-) In general I think the restructuring is a good idea, I'll try to review it (may be delayed because of Moriond season)

pkausw commented 1 year ago

Hi @anigamova , as discussed earlier, I've updated this PR. I have cherry-picked the updates from #295 and included the latest changes from main. Therefore, this PR should now enable to merge analysis bins and/or processes at runtime :) Let me know if you have any comments!

anigamova commented 9 months ago

Hi @pkausw,

It would be great to do what @ajgilbert suggested in PR #295, i.e. try to call GetShapeWithUncertainty method once, so that we don't generate 500 toys for each histogram but rather save the list of all of the shapes we want to produce and pass it to the GetShapeWithUncertainty function at the end and loop over the shapes for each toy once. I will also leave few comments where I think it would make sense to implement these changes with your PR.