cms-analysis / HiggsAnalysis-CombinedLimit

CMS Higgs Combination toolkit.
https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/latest
Apache License 2.0
75 stars 380 forks source link

Remove OrderedDicts etc except where empirically necessary #940

Closed kcormi closed 4 months ago

kcormi commented 4 months ago

It would be good to do a few more checks on this and take a closer look to make sure its what we want before merging, but this seems to preserve the creation order of all workspace objects, as discussed in #936.

nucleosynthesis commented 4 months ago

@kcormi - a very concrete check would be to run text2workspace.py 125.5/comb_hbb.txt -m 125.5 (from the cms Higgs observation area) twice with -v 3 and check the output diff. If they are the same with this PR, then I think you've narrowed down the changes that were really crucial and we can merge this one

kcormi commented 4 months ago

@nucleosynthesis, yes good suggestion. I tested this now, and there seems to be a difference in a RooProdPdf print out which I don't understand. I seem to get a same/similar one with the current main branch though. So I'm not sure if its actually related, or if I've just made a silly mistake in trying to test it.

I'll need to investigate a little later, but I'll try to wrap this up soon so we can merge it.

kcormi commented 4 months ago

It seems in my tests that I still get a few lines of differences, even when using the fully ordered collections.

It's always just a few lines with the RooProdPdfs, and the difference doesn't seem to be in the pdf itself, but that in one of the two lines printing the pdf out, there is also an error of the style

[#0] ERROR:InputArguments -- RooArgSet::checkForDup: ERROR argument with name CMS_vhbb_stats_TT_ZmmLoose7TeV is already in this set

I'm not sure if it's really a problem, I suspect its not, but I'll try to see if I can track it down regardless.

guitargeek commented 4 months ago

Here is the stack trace for where the ERROR comes from:

Thread 1 "python" hit Breakpoint 1, RooArgSet::checkForDup (this=0x5555690d37e8, var=..., silent=<optimized out>)
    at /home/rembserj/code/root/roofit/roofitcore/src/RooArgSet.cxx:285
285    coutE(InputArguments) << "RooArgSet::checkForDup: ERROR argument with name " << var.GetName() << " is already in this set" << endl;
(gdb) bt
#0  RooArgSet::checkForDup (this=0x5555690d37e8, var=..., silent=<optimized out>)
    at /home/rembserj/code/root/roofit/roofitcore/src/RooArgSet.cxx:285
#1  0x00007fffdda81b4c in RooArgSet::canBeAdded (this=<optimized out>, arg=..., silent=<optimized out>)
    at /home/rembserj/code/root/roofit/roofitcore/inc/RooArgSet.h:190
#2  0x00007fffddac9102 in RooAbsCollection::add (this=this@entry=0x5555690d37e8, var=...,
    silent=silent@entry=false) at /home/rembserj/code/root/roofit/roofitcore/src/RooAbsCollection.cxx:436
#3  0x00007fffd780ada4 in RooCollectionProxy<RooArgSet>::add (silent=<optimized out>,
    shapeServer=<optimized out>, valueServer=<optimized out>, var=..., this=<optimized out>)
    at /home/rembserj/spaces/master/install/include/root/RooCollectionProxy.h:192
#4  RooCollectionProxy<RooArgSet>::add (silent=<optimized out>, var=..., this=<optimized out>)
    at /home/rembserj/spaces/master/install/include/root/RooCollectionProxy.h:114
#5  SimpleCacheSentry::addVars (this=this@entry=0x5555690d34f0, vars=...)
    at /home/rembserj/HiggsAnalysis/CombinedLimit/src/SimpleCacheSentry.cc:43
#6  0x00007fffd782afce in FastVerticalInterpHistPdf2Base::initBase (this=0x5555690d3020)
    at /home/rembserj/HiggsAnalysis/CombinedLimit/src/VerticalInterpHistPdf.cc:826
#7  FastVerticalInterpHistPdf2Base::initBase (this=0x5555690d3020)
    at /home/rembserj/HiggsAnalysis/CombinedLimit/src/VerticalInterpHistPdf.cc:813
#8  0x00007fffd782b415 in FastVerticalInterpHistPdf2::evaluate (this=0x5555690d3020)
    at /home/rembserj/HiggsAnalysis/CombinedLimit/src/VerticalInterpHistPdf.cc:895

Indeed, the SimpleCacheSentry (whatever that does) uses a RooArgSet as a container, and is therefore not able to deal with different RooAbsArg instances that have the same name: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/blob/main/src/SimpleCacheSentry.cc#L32

There is a check if (_deps.containsInstance(*a)) continue; to make sure the same instance is not added twice, but it seems this FastVerticalInterpHistPdf2 uses a list of coefficient, where some coefficients have the same name but are different RooAbsArgs. This might or might not be a bigger problem: RooFit is inconsistent in treating different instances with the same name. For example in a RooWorkspace, there can't be different variables with the same name, everything is unique. But when objects are then copied when doing the NLL fit, there might be independent copies with the same name, depending on what the framework does. This situation should be avoided, because in the worst case these independent instances might get out of sync in value, even though they were mathematically intended to be the same.

This is the reason why RooFit models are often magically fixed by writing the workspace to a file and reading it back: like this, all variables with the same name are de-duplicated, which is usually the intent of the model builder.

So in this case, the ERROR is probably not an issue because you are importing the model to workspace anyway. While doing that, all variables are deduplicated.

guitargeek commented 4 months ago

Confirmed, the errors were harmless: we just tried to fill a collection with the same set of objects twice, and the second time you get an error instead of adding more elements. Can be avoided: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/pull/952

kcormi commented 4 months ago

Thanks @guitargeek -- I will merge this one. I need to double check #952 a little before merging it, but looks fine.