Need for adaptive percentile cut calculation and evaluation of binned cut

chaimain commented 2 years ago

With the current new LST-1 MC simulations pointing at different direction nodes in sky, the energy range of the MC files also vary. For example, with the current MC nodes with LST-1, we have a minimum energy of 5-50 GeV and a maximum energy of 75-200 TeV. If one uses different energy ranges for creating IRFs for respective nodes, then we run into problems with IRF interpolation with inconsistent energy bins, whereas if we just use a superset of all necessary energy ranges (5 GeV to 200 TeV), we include bins with no MC events, but a cut value stored as the fill_value.

One suggestion I can think of is to have an extra column with boolean information, to state if that particular energy bin has zero events, or non-zero events. This will also help with not including real events, when creating DL3 files, in those energy bins where we do not have MC events.

Another usage of such adaptive function can be to investigate non-uniform distribution of gammaness (or theta), in some energy bins, which may lead to strange evaluation of percentile cuts. I see some blob of events in the high energy bins in gammaness distribution, like, new_mc_node_32_gh_dist_zoomed I am assuming it is due to some strange high-energy threshold gamma-hadron separation issue, and, so, maybe one can have a safety check for such huge discontinuous set of events, and make the percentile cut calculation a bit more adaptive.

@maxnoe @RuneDominik Let me know if I am missing something on this issue, or of there is a better alternative.

maxnoe commented 2 years ago

Can you describe a concrete issue in the data analysis you are facing? I don't see any here... these seem to be concerns from looking at plots. Did you actually encounter something going wrong when applying this to the IRFs / the resulting IRFs to data?

Yes, the solution is to use the same energy binning for the IRFs (or at least only additional bins at the lower / higher energy end and then use the overlapping part for interpolation).

kosack commented 2 years ago

One suggestion I can think of is to have an extra column with boolean information, to state if that particular energy bin has zero events, or non-zero events. This will also help with not including real events, when creating DL3 files, in those energy bins where we do not have MC events.

Seems there are two issues here:

energy binning (which as @maxnoe said should simply be fixed to a large range and should not change per node) - the energy ranges can be either computed from the A_eff for each bin, or stored in the headers as Gammapy does now.
missing/underfilled bins: This is a more general issue: what to do with IRF bins with low or zero stats. For CTA/DPPS, we plan on either including error bars at each point that provide this information or a boolean mask of "usable" bins, and/or applying smoothing/interpolation/extrapolation functions such that all bins at least have a value (likely a combination of both, since it can be dangerous to simply extrapolate/interpolate when stats are very low). In the example above, however, it looks like smoothing would work fine to fill in the hole.

Also, it's true you should be careful using percentiles for things like gammaness, which is not guaranteed (or even expected) to be normally distributed.

chaimain commented 2 years ago

@maxnoe We have some issues with the MC produced with zenith 40+ deg, due to the limitation on the training dataset used for the currently used RF model, but still, trying to interpolate IRF for a data (Run 2967), with the IRF nodes [zen, az] used for interpolation - [32.059, 102.217], [43.197, 87.604] and [10.0, 102.199] gives the following gammaness cuts distributions (along with other nodes). gh_cuts_dist_diff_nodes_w_data I used a similar interpolation function as #180 for interpolating the GH_CUTS table (similar to RAD_MAX table). Here you can see that for the interpolated IRF, the cut is applied at energy ranges, where the events in data also does not exist (Max energy for data ~40 TeV).

chaimain commented 2 years ago

@kosack Thanks for the suggestion. I will try using smoothing to check on the percentile cut for gammanes.

chaimain commented 2 years ago

I have not tested further how it affects the full data analyses, but I think this is a roadblock to first address - having (interpolated) IRF binned in energies, where we do not have events in real data.

maxnoe commented 2 years ago

roadblock to first address - having (interpolated) IRF binned in energies, where we do not have events in real data.

I don't understand why this should be a problem at all, let a lone a road blocker.

Having IRF defined in regions where you didn't observe data (probably purely because of low observation times to record enough high energy showers) shouldn't create any problems.

chaimain commented 2 years ago

roadblock to first address - having (interpolated) IRF binned in energies, where we do not have events in real data.

I don't understand why this should be a problem at all, let a lone a road blocker.

Having IRF defined in regions where you didn't observe data (probably purely because of low observation times to record enough high energy showers) shouldn't create any problems.

No problems in data analysis as far as I understand, should arrive by this, but it seemed to me that we should have a "boolean mask" as @kosack also suggested, so that we can have statistically significant matching of the energy binning of ("generally good" GTI observation) data and IRFs.

But it seems, that I am thinking in the wrong direction for it. I will check further in my analysis to see anything that can relate to this issue.

moralejo commented 2 years ago

Hi @chaimain, I do not see any big problem here.

Clearly, the interpolation will be using pointings with different energy ranges, and we will only be able to obtain reliable interpolated values at energies for which all interpolated points have enough statistics (also after trigger & analysis).

Eventually, we may find out that the produced stats are too low / the sky grid step is too coarse, or the production energy ranges can be improved, but I don't think we are yet at that point.

I do not see why you would need additional info in the MC files, everything you need to know where the MC was generated, and what stats you have in Ereco bins, is available in the MC DL2 files.

As for non-uniform distributions of gammaness, it is a completely unrelated topic, right? First of all, showers are used more than once, with different telescope positions, so fluctuations may look more significant than they really are (and certainly one should decide how to treat that e.g. when you are determining cut efficiencies). It might as well be a genuine effect, and then the approach should be to find what those events are, checking their reconstruction in detail, to understand whether this is an analysis issue or perhaps some problem with the simulation.

cta-observatory / pyirf

Need for adaptive percentile cut calculation and evaluation of binned cut #181