arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
176 stars 64 forks source link

Large variations in signal/background distributions #73

Open TommyDESY opened 2 years ago

TommyDESY commented 2 years ago

Hi everyone,

I'm currently using uBoost for my Belle II analysis. For context, I'm trying to separate between B -> Xu l nu signal and B -> Xc l nu background.

As I'm investigating the best target efficiency for my case, I plotted the signal and background counts after uBoost classification for 100 target efficiencies. I noticed some kinks at various points in the distributions. Please see the attached plots for more clarity. bkg_count_ex sig_count_ex uboost_megaplot

Even though, these variations don't seem to be large visually they actually correspond to ~10% fluctuations and eventually impact other variables such as the significance. Again, see the attached plots for an example.

I'm wondering if this is a known feature of uBoost or if this behaviour is caused by my sample or choice of variables/parameters.

Thank you for your answers !

If you need more info/context to my question, please tell me, I realise my explanation is quite shallow for now. #

Cheers, Tommy

arogozhnikov commented 2 years ago

Hi Tommy, maybe I misunderstand what you do, but it looks that you look at individual predictions of UboostBDTs, each for a specific efficiency. If that's true, I'm a bit surprised that result is so smooth... Part of Uboost thinking is to run multiple BDTs with different efficiencies to smooth out intrinsic variability of this process.

In estimating properties of final classifier, I would not recommend to look at individual components. Like, BDTs behavior is poorly explained by individual trees, and similarly Uboost behavior is poorly explained by looking at individual efficiency-targeted BDTs in it.

Instead, analyze the predictions of Uboost in general, make a sweep of thresholds and plot singal efficiency and background efficiency. I expect you plots to be more smooth (at the very least, both should be monotonically increasing)

Cheers

TommyDESY commented 2 years ago

Hi Alex,

Thank you for your answer.

I understand better the conceptual workflow of uBoost now. However, I struggle on the technical side. I guess your imply to use the uBoostClassifier class then and not uBoostBDT individually ? I can make the latter work but not the former. I reckon uBoostClassifier runs a given number of uBoostBDTs with different target efficiencies, correct ? With uBoostClassifier, all the events in my dataset are always classified as signal, no matter how I tune the parameters.

I'm obviously doing something wrong but I can't quite understand what :/

arogozhnikov commented 2 years ago

Yeah, the thinking is that you use full model (that is uBoost, that is ensemble of ensembles). It is notoriously slow, but that's how it was designed.

all the events in my dataset are always classified as signal

That's surprising. What about area under the ROC curve? It may be that all predictions are shifted (like, all probabilities are > 0.5), but resulting classifier still has ok properties in terms of sig vs bck separation and flatness

TommyDESY commented 2 years ago

All the probabilities for the signal are between 0.5 and 0.7. I have a 100% signal and background efficiency in this case. I looked at the probabilities at every stage (and therefore at different target efficiencies) with the function staged_predict_proba(), and the probabilities always look the same. The ROC curve and flatness don't make sense here.

proba_uboostclassifier

I noticed this behaviour quite some time ago, which is why I stopped using the full classifier. uBoostBDT works as you can see from the plots I showed in my first post.

arogozhnikov commented 2 years ago

I see, the reason is this squashing function, https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L540 it's not necessary and you can remove that if you don't like the range (just return score / self.efficiency_steps)

An important comment though is just don't interpret uBoost outputs as probabilities (I know name of function says so, but in reality you'll need additional steps to calibrate that to probability). Proper way to think about outputs as some new discriminating variable that is more useful than existing ones.

(So, that's not cool that uBoost returns output in such a narrow range, but that's not a problem either - users shouldn't expect it to behave like probs, and select thresholds according to their needs)

As of .predict method that is part of sklearn interface - there are practically no cases in HEP when you should use it. Better just forget about its existence :)

TommyDESY commented 2 years ago

Thank you for all your answers ! I managed to make things work now.

I would still have one more question. I cannot see the parameters learning_rate and uniforming_rate in the uBoostClassifier class but they do exist in uBoostBDT. As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ? I couldn't find any answer in the documentation.

arogozhnikov commented 2 years ago

As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ?

Not really, they can be exposed.

Just at that time idea was to follow original uBoost paper (and in original paper, there is a modification of 'vanilla' adaboost, which does not have learning rate as a parameter). From a practical perspective, I think LR would be very helpful.