arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
176 stars 64 forks source link

fix: original weights wrongly normalized #69

Closed jonas-eschle closed 3 years ago

jonas-eschle commented 3 years ago

When predicting the weights, the original weights should not be normalized (as also written in the docs), I assume this is abug?

Furthermore, I think it should also not normalize in the training method, right? Because if we do (or pass the argument through), there is no way of having different normalizations.

In fact, shouldn't the right way be to remember the normalization?

arogozhnikov commented 3 years ago

From theoretical perspective you're right in this patch;

however normalization done here forces users to make proper corrections themselves, not relying on output's weight; I don't see much problem in its existence (though maybe you have a opposite case? if so, please share!).

Furthermore, I think it should also not normalize in the training method, right? Because if we do (or pass the argument through), there is no way of having different normalizations.

it's much better to be normalized in training, just from the point of regularization and having more stable computations.

In fact, shouldn't the right way be to remember the normalization?

I don't think so: total weights present in training would be more defined by 'how much simulation events did I have' or 'how many events should be expected in this other channel'; I certainly don't want user to make this "normalization planning" ahead on other data;

Normalization should be an explicit step with conscious choices after reweighting, I think current code matches this vision.

jonas-eschle commented 3 years ago

however normalization done here forces users to make proper corrections themselves, not relying on output's weight; I don't see much problem in its existence (though maybe you have a opposite case? if so, please share!).

That may be true, but the problem here is not with the weights generated by the algorithm but with the original weights. These are provided by the user and are fine, they should not be touched. In fact, they are not actually needed and a user could just multiply the weights.

Most importantly, the current behavior is not the documented one: It says Result is computed as original_weight * reweighter_multipliers while in reality it is original_weight * reweighter_multipliers / sum(original_weight). This can in fact be a major bug which goes unnoticed if the initial weights are normalized to something around one (but the normalization has a meaning). I would therefore even suggest something like a warning for the "changed behavior" as it could have had strong impacts on analyses.

A simple usecase that can also create bugs is by the assumption tha predicting events should be independent of each other (if the reweighting function is fixed, than this is the case, if that would not hold, then a whole different discussion about what reweighting actually is may be needed). So splitting the array into two arrays, predicting, and concatenating currently does not yield the same event weights than predicting the events individually (if original_weights are given). That can be a common usecase when a file is too large or it is split into regions.

As an example just consider x1 and x2 for which the reweighter response function is 1 (a trivial example) and with original weights w1 = 0.5 and w2 = 1.5. If given simultaneously, e.g. as an array-like, the returned weights are correct, 0.5 and 1.5. If given separately, both weights will be 1.

it's much better to be normalized in training, just from the point of regularization and having more stable computations.

That is true for the ML part for sure. For the conceptual part of the reweighting, the ration can play a role and one can argue to leave that to the user. However, this means that the reweighter does not contain all the information and additional information need to be tracked.

It is though, I agree, not per se a problem, at least I have not come across it so far.

I don't think so: total weights present in training would be more defined by 'how much simulation events did I have' or 'how many events should be expected in this other channel'; I certainly don't want user to make this "normalization planning" ahead on other data;

I meant rather the ratio of the class weights as this can be an inherent characteristics of the two datasets (but as mentioned above, I can't think directly of a use-case). Furthermore, if, then a "ratio" argument may be more suited than taking the ratio of the weights sum implicitly. But, as said, it's indeed not a needed feature maybe and can well be done manually.

arogozhnikov commented 3 years ago

Does not align with docs -> ok, that should be fixed Example with splitting makes sense. Need to add explicit reminder in docs to that total normalization after any reweighting may be changed.

jonas-eschle commented 3 years ago

Many thanks for this and the library, it has some neat tools and good to see that it is so actively maintained!

And I agree, that reminder could be good, physicists tend to use algorithms in a black-box style :)