arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
176 stars 64 forks source link

Error propagation from weights #65

Closed acampove closed 4 months ago

acampove commented 3 years ago

Hello @arogozhnikov ,

I am using the GBReweighter. Before the weights are applied, I can assume that my datasets is described by

{x1, x2...}

after the weights are applied the dataset is:

{(xi, wi) : i in [1, n]}

i.e. it depends on the weights. Therefore if I had initially a function f (xi), now that function is f(xi, wi). The weights wi are dependent on the knowledge of the data (target) and simulation (original) distributions. However we have finite samples for these and the weights should be assigned an error so that we could estimate the propagated error in f(xi, wi). Is there a way to estimate the error in these wi weights? How are these errors correlated, because those correlations would be needed to estimate the propagated error on f(xi, wi).

Cheers.

arogozhnikov commented 3 years ago

Estimating correlation is a dead thing for that many numbers. Better just model that correlation with training multiple models (e.g. by different cross-val splits), and estimate error of downstream processing by having multiple replicas

val1 = f(x_i, w_1i)
val2 = f(x_i, w_2i)
val2 = f(x_i, w_3i)
etc.
acampove commented 3 years ago

That approach would mean doing probably 200 trainings for the MVA. The real data usually is background subtracted using the sPlot technique. Apparently (from studies done by other people) we cannot bootstrap the sweighted sample, we have to:

  1. Bootstrap the unweighted data + simulation
  2. Obtain the sweights doing the fit.
  3. Train the MVA on the bootstrapped data and simulation

Many times, which seems very challenging computationally. In case of 2D or even 3D reweighting, I think this just means that we should not use hep_ml, given that obtaining uncertainties is highly non trivial and a number without uncertainties is pretty useless. Hep_ml would only be an alternative once you start thinking of reweighting in higher dimensions.

acampove commented 3 years ago

By the way, the bootstrapping argument also applies to k-Folding.

arogozhnikov commented 3 years ago

You surely can subsample sweighted samples. Opposite would mean your fitting is unstable (and hence probably wrong).

jonas-eschle commented 4 months ago

@acampove do you maybe have sources for

(from studies done by other people)

I would, AFAIU, second the argument of @arogozhnikov that, in general, you can do that

acampove commented 4 months ago

@acampove do you maybe have sources for

(from studies done by other people)

I would, AFAIU, second the argument of @arogozhnikov that, in general, you can do that

Hi Jonas,

This is what I remember Christoph Langenbruch said once in a meeting the measurement of the Bs-> phi mumu branching ratio. I would check their note or just talk to him, maybe you know him better than I do :)

Cheers.