Open bifani opened 8 years ago
This is a frequent question (or family of questions) from physicists, who are interested in applying reweighting to one more data sample. Below I give solutions for different situations.
Frequently applicable, but for some reason ignored by physicists (ROOT influence?) solution is read this file inside the same script/notebook and apply reweigher.
You can store the weights column using recipe from this issue.
Possible reasons:
You can use cPickle. Works as following:
import cPickle as pickle
# saving formula
with open('reweighter.pkl', 'w') as f:
pickle.dump(reweighter, f)
#loading formula
with open('reweighter.pkl') as f:
reweighter = pickle.load(f)
(needed when you need to build it inside some production script / experiment)
When applying formula, reweighter is not much different from simple gradient boosting / random forest (see how predict_weights
works).
hep_ml
uses own BDT, but it is easily converted from/to sklearn
.
There are solutions, which convert sklearn's trees to TMVA format: koza4ok and sklearn-pmml.
Warning: I haven't tried any of those, since I am not using TMVA, so I expect many caveats on that way. If someone tried and succeeded with exporting to TMVA, let me know.
Hi Alex,
thanks a lot for the quick feedback! cPickle looks like what I need, I'll give this a go
Regards, s.
I have a question about converting from hep_ml
BDTs to sklearn
BDTs. I am trying to use the "exporting to TMVA" method via koza4ok, and it works with a few tweaks:
classifiers['uGBFL'].loss_ = classifiers['uGBFL'].loss
classifiers['uGBFL'].loss_.K = 1
classifiers['uGBFL'].estimators_ = np.empty((classifiers['uGBFL'].n_estimators, classifiers['uGBFL'].loss_.K), dtype=np.object)
for i,est in enumerate(classifiers['uGBFL'].estimators): classifiers['uGBFL'].estimators_[i] = est[0]
However, I am not sure the last line gives the correct output. In UGradientBoostingClassifier
, the estimators_
member is a list of [tree, leaf_values]
. The leaf_values
first come from the tree
, but then get updated: https://github.com/arogozhnikov/hep_ml/blob/41e97d598e621ce323a92a607625213ef9d45a36/hep_ml/gradientboosting.py#L136-L144
At the end, get_leaf_values()
returns a different array than the leaf_values
stored in the estimators_
list:
>>> print classifiers['uGBFL'].estimators[0][0].get_leaf_values()
[ 0.01252273 -1.72148748 -2.77744433 -1.07583091 0.29113487 0.16071584
0.05392691 1.75249969 2.29887652]
>>> print classifiers['uGBFL'].estimators[0][1]
[ 0. 0. -2.6523975 -1.15883605 0. 0.
0.08844491 1.44762732 2.12097526]
Should I export the array from get_leaf_values()
, or use the leaf_values
from the list?
Hi @kpedro88
Your analysis is correct - only leaf id predicted by the tree is important, not leaf values; leaf values that are stored separately then used, (tree, leaf_values)
. So, leaf values stored inside the tree are ignored completely.
For conversion, almost surely you'll need to do the following (not tested, maybe needs corrections):
for tree, leaf_values in estimators:
new_tree = copy.deepcopy(tree)
assert new_tree.tree_.value.shape == (len(leaf_values), 1, 1)
new_tree.tree_.value[:, 0, 0] = leaf_values
<save new tree to the ensemble>
Don't forget to verify you get the same predictions before / after conversion
Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples
For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages
Thanks