HistEft to scikit hist - Githubissues

btovar commented 1 year ago

Main changes are in the histEFT.py file. API changes:

Axes need to follow the syntax from schikit HEP.
For scikit addition to work, axes should be in the same order, have the same name, and have the same label. Thus, HistEFT enforces that there should be exactly one dense axis specified, and that it should be the last one listed in init.
Removed most code related to sumw2. There is still some in YieldTools which I didn't know how to modify.
Added a small module (axes.py) which has a dictionary with the description of the axes for topeft.py It lists the regular or variable edges as needed. Currently only variable edges work, as the regular involve a rebinning that can't be done right now.
Since values() returns the bin counts in hist.Hist, added a new function call h.eval(values) that evaluates the histogram.
h.eval has a required argument. If evaluating for sm, use None or {}. It receives either a dict of coefficients or an array.
Removed the set_wc_coefficients* functions and the internal self_wcs. Instead:
h.eval(wcs) method returns what h.set_wc_coefficients(wcs); h.values() would do.
h.eval has a parameter to include the under/overflow bins. This seems to match what values() did, but it didn't make much sense to me. Should eval only return proper bins?
Regarding overflow, hist.Hist has only the option to include it or not (boolean). There is not a special option for NaN. NaN gets added to overflow bins.
The new HistEFT cannot be used as a regular hist.Hist, as If exactly one dense axis needs to be defined. If no wc names coefficients are specified, it is assumed that the wc coefficients for filling the histogram are 1.
Previous hist iterated on the keys of h._sumw. Added a function h.spare_keys() that generates all the possible combination of keys.
Added some functions for compatibility, such as group() and integrate().
eft_helper.calc_eft_weights seems to depend strongly on the structure of these particular histograms, so maybe is a good idea to move it to histEFT.HistEFT
Some pickled data (e.g. data/triggerSF/*) depend on the old HistEFT module and need to be regenerated before HistEFT can be removed.

All but two unit tests are working:

tests/test_HistEFT_add.py::test_split_by_terms. h.split_by_terms needs to be rewritten.
tests/test_yields.py::test_compare_yields_after_processor. The values generated are 0, so I think that there is an error on which bins are being looked at (e.g. the undeflow bins) in YieldTools or datacard related files. I didn't understand the code enough to fix this one.

If you have any comments I can fix those, but probably the two failing tests need someone that understand the related physics.

btovar commented 1 year ago

Regarding rebining, currently hist does not support it with variable edges. Thus, there is no easy way to update this line:

https://github.com/TopEFT/topcoffea/blob/501fd5d655783c7862d09bd42f40e2b841acbd0c/topcoffea/modules/datacard_tools.py#L480C1-L480C81

    edge_arr = self.BINNING[km_dist] + [h.axis(km_dist).edges()[-1]]
    h = h.rebin(km_dist,Bin(km_dist,h.axis(km_dist).label,edge_arr))

As @kmohrman pointed out, it should be possible to create the histogram with the correct edges to begin with. I see the edges are created here:

https://github.com/TopEFT/topcoffea/blob/501fd5d655783c7862d09bd42f40e2b841acbd0c/analysis/topEFT/topeft.py#L58

Could I just create those axes with the edge values from the datacard_tools.py module?

klannon commented 1 year ago

@bryates @Andrew42 @sscruz Please see the comment from @btovar above. For some context, it is critically important that we migrate away from coffea histograms to the new scikit-hep based hist class. Unfortunately, this means at least temporarily migrating away from variable sized rebinning. I think it's critical that a one or more actual users looks at this and comes up with a solution. The solution proposed by @btovar and @kmohrman above would certainly work and would have the added advantage of reducing memory usage while filling the histograms, which is often the computational bottleneck. If you agree that this is a viable solution, can you either help @btovar implement this change in the analysis code or identify someone else who could help?

bryates commented 1 year ago

@bryates @Andrew42 @sscruz Please see the comment from @btovar above. For some context, it is critically important that we migrate away from coffea histograms to the new scikit-hep based hist class. Unfortunately, this means at least temporarily migrating away from variable sized rebinning. I think it's critical that a one or more actual users looks at this and comes up with a solution. The solution proposed by @btovar and @kmohrman above would certainly work and would have the added advantage of reducing memory usage while filling the histograms, which is often the computational bottleneck. If you agree that this is a viable solution, can you either help @btovar implement this change in the analysis code or identify someone else who could help?

@Andrew42 since we eventually hand the histogram off to uproot to make the root files, could we use another container that supports variable binning? Maybe we could do this by having the datacard maker convert to np.histogram first, and then do the rebinning??

btovar commented 1 year ago

See #384.

TopEFT / topeft

HistEft to scikit hist #371