cta-observatory / ctapipe

Low-level data processing pipeline software for CTAO or similar arrays of Imaging Atmospheric Cherenkov Telescopes
https://ctapipe.readthedocs.org
BSD 3-Clause "New" or "Revised" License
64 stars 268 forks source link

Weighting events when training sklearn models #2280

Open gschwefer opened 1 year ago

gschwefer commented 1 year ago

Please describe the use case that requires this feature. When training any of the sklearn models currently implemented I want to (re-)weight events to e.g a different energy spectrum to use the same simulation set for multiple models.

Describe the solution you'd like I think it would be nice to have weight as a trait of the training tools in combination with a FeatureGenerator that allows you to use e.g.true_energy**-1 as the weight. This is then used in the corresponding fit() function of the model class.

Describe alternatives you've considered I'm open to all suggestions and advice on how this could be done.

kosack commented 1 year ago

This is absolutely needed, as we are not guaranteed to have uniform distributions as input (with sims now, we used E^-2, which is not bad as it is flat in energy density, but in the future we may have a more complex distribution) In protopipe we always passed a sample_weight option to the fit(x, Y, sample_weights) function of each regressor. We could allow a sample_weight function in the config that the user can specify, similar to what you say above.

maxnoe commented 1 year ago

The corresponding weighting functions are already in pyirf, so it's maybe time to make pyirf a depedency of ctapipe and use those event reweighting functions for this.

gschwefer commented 1 year ago

I assume you are refering to everything in pyirf/spectral.py? Is there generally a need to be more general than that and allow reweighting based on parameters other than the energy spectrum? A hypothetical scenario could be when the core positions of events are not simulated homogeneously to get more events close to the telescopes which you would then correct through weights on core position. But I don't know how relevant that is...

maxnoe commented 1 year ago

A hypothetical scenario could be when the core positions of events are not simulated homogeneously to get more events close to the telescopes which you would then correct through weights on core position. But I don't know how relevant that is...

Extremely relevevant as soon as this approach is used in actual MC productions for CTA, the possibility is already there in simtel array, see e.g https://github.com/cta-observatory/ctapipe/issues/1577

gschwefer commented 1 year ago

Interesting. Would you treat these weights separately from the spectral weight because there is one "right" distribution? Because otherwise it wouldn't make sense to use the pyirf functions that are only for spectral weights.

maxnoe commented 1 year ago

Yes these weights are independent (at the moment at least). You need to apply both multiplicatively to arrive at any meaningful physical flux model.

kosack commented 1 year ago

Also note that we have the /simulation/service/shower_distribution in the output files, which is a 2D histogram of energy and core distance. That can be used to re-weight events to different offset and energy distributions. In current simulations, the core_dist dimension is a flat distribution, but that is not required.