arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
176 stars 64 forks source link

Negative sWeights #51

Open marthaisabelhilton opened 6 years ago

marthaisabelhilton commented 6 years ago

Hi,

I am trying to use the BoostingToUniformity notebook, in particular the uBoost classifier. I am getting the error message 'the weights should be non-negative'. I have tried removing this from the source code and tried to run uBoost without this line. When I use the 'predict' function I get an array of all zeros and when I try to plot the ROC curve I get nans as the output. I am wondering if there is a way of dealing with negative weights?

Many thanks,

Martha

arogozhnikov commented 6 years ago

Hi Martha, negative weights aren't friendly towards ML because of driving to non-convex unbounded optimization, so you should not expect those to work right for ML models (sometimes they do, however).

@tlikhomanenko sometime ago prepared an overview of strategies for dealing with negative weights, but the first thing you'd better try is to simply remove samples with negative weights from training (but not from testing, that's important)

tlikhomanenko commented 6 years ago

Hi Martha,

Please have a look at this notebook https://github.com/yandexdataschool/mlhep2015/blob/master/day2/advanced_seminars/sPlot.ipynb prepared for a summer school. There is a part called "Training on sPlot data" where you could find several approaches how to train your classifier on data with negative and positive weights. Hope, you'll find them useful.

alexpearce commented 6 years ago

For classifiers that only compute statistics on ensembles of events whilst fitting, like decision trees, I would hope that an implementation would accept negative weights, rather than doing assert (weights < 0).sum() == 0.

When it should fail is if the sum of weights in an ensemble currently under study is negative.

marthaisabelhilton commented 6 years ago

Thanks for your responses. I have tried removing the negative weights from my training sample and classifier.predict(X_train) is giving me an array of all 1's. Do you know why this is happening?

I am using a similar method to the 'Add events two times in training' section in the notes above.

arogozhnikov commented 6 years ago

@alexpearce

Hey Alex, I don't think it is so different for trees. Things may go arbitrarily bad in very simple situations:

reg = GradientBoostingRegressor(n_estimators=100, max_depth=1).fit(numpy.arange(2)[:, None], numpy.arange(2), sample_weight=[-0.9999999999, 1])
reg.predict(numpy.arange(2)[:, None])
# outputs: array([9.99999917e+09, 9.99999917e+09])

@marthaisabelhilton

No idea, but try to use clf.predict_proba to see if those provide meaningful separation.

alexpearce commented 6 years ago

Yes, negative weights certainly can make things go bad, but in the case of very low sample sizes sWeights also don't make much sense, they only give 'reasonable' results with 'larges ensemble (all poorly defined terms of course). That's what I was suggest algorithms don't check immediately for negative weights, but only when actually computing quantities used in the fitting.

arogozhnikov commented 6 years ago

@alexpearce Well, in such case you should check for sum in each particular leaf of the tree (since we aggregating over samples in a leaf).

I see potential complains like "it just worked with two trees, what's the problem with the third one?" (in a huge ensemble like uboost almost surely this check will be triggered), but I don't mind if anyone decides to PR such checks.

alexpearce commented 6 years ago

Well, in such case you should check for sum in each particular leaf of the tree (since we aggregating over samples in a leaf).

Yes, exactly. The check should be made at that point, rather than when the training data is first fed into the tree.

And you're right, I should just open a PR if I think this is useful behaviour. I'll look into it.

(You're also right, for the third time, that I might be underestimating how often an ensemble of negative weights will have a negative sum, but I would leave that problem up to the users, to tune the hyper parameters.)