arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
179 stars 64 forks source link

Exporting/saving/reusing the reweighting formula #33

Open bifani opened 8 years ago

bifani commented 8 years ago

Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples

For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages

Thanks

arogozhnikov commented 8 years ago

This is a frequent question (or family of questions) from physicists, who are interested in applying reweighting to one more data sample. Below I give solutions for different situations.

Working from the same script

Frequently applicable, but for some reason ignored by physicists (ROOT influence?) solution is read this file inside the same script/notebook and apply reweigher.

You can store the weights column using recipe from this issue.

When you need to store formula

Possible reasons:

You can use cPickle. Works as following:

import cPickle as pickle
# saving formula
with open('reweighter.pkl', 'w') as f:
    pickle.dump(reweighter, f)

#loading formula
with open('reweighter.pkl') as f:
    reweighter = pickle.load(f)

Exporting to TMVA

(needed when you need to build it inside some production script / experiment)

When applying formula, reweighter is not much different from simple gradient boosting / random forest (see how predict_weights works).

hep_ml uses own BDT, but it is easily converted from/to sklearn.

There are solutions, which convert sklearn's trees to TMVA format: koza4ok and sklearn-pmml.

Warning: I haven't tried any of those, since I am not using TMVA, so I expect many caveats on that way. If someone tried and succeeded with exporting to TMVA, let me know.

bifani commented 8 years ago

Hi Alex,

thanks a lot for the quick feedback! cPickle looks like what I need, I'll give this a go

Regards, s.

kpedro88 commented 6 years ago

I have a question about converting from hep_ml BDTs to sklearn BDTs. I am trying to use the "exporting to TMVA" method via koza4ok, and it works with a few tweaks:

classifiers['uGBFL'].loss_ = classifiers['uGBFL'].loss
classifiers['uGBFL'].loss_.K = 1
classifiers['uGBFL'].estimators_ = np.empty((classifiers['uGBFL'].n_estimators, classifiers['uGBFL'].loss_.K), dtype=np.object)
for i,est in enumerate(classifiers['uGBFL'].estimators): classifiers['uGBFL'].estimators_[i] = est[0]

However, I am not sure the last line gives the correct output. In UGradientBoostingClassifier, the estimators_ member is a list of [tree, leaf_values]. The leaf_values first come from the tree, but then get updated: https://github.com/arogozhnikov/hep_ml/blob/41e97d598e621ce323a92a607625213ef9d45a36/hep_ml/gradientboosting.py#L136-L144

At the end, get_leaf_values() returns a different array than the leaf_values stored in the estimators_ list:

>>> print classifiers['uGBFL'].estimators[0][0].get_leaf_values()
[ 0.01252273 -1.72148748 -2.77744433 -1.07583091  0.29113487  0.16071584
  0.05392691  1.75249969  2.29887652]
>>> print classifiers['uGBFL'].estimators[0][1]                  
[ 0.          0.         -2.6523975  -1.15883605  0.          0.
  0.08844491  1.44762732  2.12097526]

Should I export the array from get_leaf_values(), or use the leaf_values from the list?

arogozhnikov commented 6 years ago

Hi @kpedro88 Your analysis is correct - only leaf id predicted by the tree is important, not leaf values; leaf values that are stored separately then used, (tree, leaf_values). So, leaf values stored inside the tree are ignored completely.

For conversion, almost surely you'll need to do the following (not tested, maybe needs corrections):

for tree, leaf_values in estimators:
    new_tree = copy.deepcopy(tree)
    assert new_tree.tree_.value.shape == (len(leaf_values), 1, 1)
    new_tree.tree_.value[:, 0, 0] = leaf_values
    <save new tree to the ensemble>

Don't forget to verify you get the same predictions before / after conversion