IBM / unitxt

🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
https://www.unitxt.ai
Apache License 2.0
151 stars 38 forks source link

Increase n_resamples for GlobalMetric in testing so confidence intervals are not NaN #382

Open sam-data-guy-iam opened 9 months ago

sam-data-guy-iam commented 9 months ago

@eladven @matanor In test_utils/metrics.py/test_metric, for a GlobalMetric we have

    if isinstance(metric, GlobalMetric) and metric.n_resamples:
        metric.n_resamples3  # Use a low number of resamples in testing for GlobalMetric, to save runtime

while this may be a good way to save runtime, it appears to cause an issue in the testing because it can cause the confidence intervals (which by default use bias-corrected accelerated BCa method) to be undefined. In my case, the resampled scores (the input theta_hat_b to _bca_interval), which consist only of 3 numbers (n_resamples as set above) are all above the value of the single value computed as "theta_hat". This causes the value of "percentile" to be 0, and therefore z0_hat, which is used to compute the rest of the interval, to be -Inf, which causes the other computations to be NaN. Basically we need theta_hat to be somewhere in the range of the values of theta_hat_b, which would cause percentile to be in (0,1) and not exactly 0 or 1 (see formulas http://users.stat.umn.edu/~helwig/notes/bootci-Notes.pdf slides 34-35). The likelihood of this happening increases if we allow some more resamples, perhaps 15 or 20 is enough for testing purposes. It seems like some other GlobalMetric objects like NDCG and Squad are tested as MetricPipeline, not GlobalMetric, so this is not an issue for them. Not sure if it affects other tests.

yoavkatz commented 9 months ago

@matanor - Please advise. I saw these warnings too.

matanor commented 9 months ago

@sam-data-guy-iam i think we can increase, the disadvantage is only test runtime. Note that changing here would require updating the expected confidence intervals of other GlobalMetric derived objects. This can be done by changing the value, running the tests, and manually updating the relevant expected value.

sam-data-guy-iam commented 9 months ago

@matanor okay like I said I think this is a simple fix to increase it to say 10-15 just for the testing. Of course, we could leave it as is and just accept NaNs in the confidence interval, it just looks a little odd in the testing, and it seems like it is not predictable in advance whether the theta_hat ends up outside of the range, but having a few more resamplings would make it very unlikely. Let me know what you decide either way.