aangelopoulos / ppi_py

A package for statistically rigorous scientific discovery using machine learning. Implements prediction-powered inference.
MIT License
205 stars 15 forks source link

ppi_distribution_label_shift_ci Exception? #16

Open kitkhai opened 3 months ago

kitkhai commented 3 months ago

Hi was playing around with the ppi_distribution_label_shift_ci function and was supplying dummy values when I encountered an exception. I'm not very sure if I defined the nu vector correctly as I'm not very sure what is it for and how to define it, would appreciate if you can clarify as well. Thank you!

import numpy as np
from ppi_py import ppi_distribution_label_shift_ci

# True labels
Y = np.array([0, 1, 0, 1, 0])

# Predicted labels for labeled data
Yhat = np.array([0, 1, 1, 1, 0])

# Predicted labels for unlabeled data
Yhat_unlabeled = np.array([0, 0, 1, 1, 1, 0, 1])

# Number of classes
K = 2

nu = np.array([0, 1])

# Calling the function
result = ppi_distribution_label_shift_ci(Y, Yhat, Yhat_unlabeled, K, nu)

ValueError Traceback (most recent call last) in <cell line: 19>() 17 18 # Calling the function ---> 19 result = ppi_distribution_label_shift_ci(Y, Yhat, Yhat_unlabeled, K, nu) 20 print("Confidence Interval for class 1 probability:", result)

4 frames /usr/local/lib/python3.10/dist-packages/ppi_py/ppi.py in ppi_distribution_label_shift_ci(Y, Yhat, Yhat_unlabeled, K, nu, alpha, delta, return_counts) 1206 budget_split = 0.999999 1207 epsilon1 = max( -> 1208 [ 1209 linfty_binom(C.sum(axis=0)[k], K, budget_split * delta, Ahat[:, k]) 1210 for k in range(K)

/usr/local/lib/python3.10/dist-packages/ppi_py/ppi.py in (.0) 1207 epsilon1 = max( 1208 [ -> 1209 linfty_binom(C.sum(axis=0)[k], K, budget_split * delta, Ahat[:, k]) 1210 for k in range(K) 1211 ]

/usr/local/lib/python3.10/dist-packages/ppi_py/utils/statistics_utils.py in linfty_binom(N, K, alpha, qhat) 111 epsilon = 0 112 for k in np.arange(K): --> 113 bci = binomial_iid(N, alpha / K, qhat[k]) 114 epsilon = np.maximum(epsilon, np.abs(bci - qhat[k]).max()) 115 return epsilon

/usr/local/lib/python3.10/dist-packages/ppi_py/utils/statistics_utils.py in binomial_iid(N, alpha, muhat) 99 return binom.cdf(N * muhat, N, mu) - (1 - alpha / 2) 100 --> 101 u = brentq(invert_upper_tail, 0, 1) 102 l = brentq(invert_lower_tail, 0, 1) 103 return np.array([l, u])

/usr/local/lib/python3.10/dist-packages/scipy/optimize/_zeros_py.py in brentq(f, a, b, args, xtol, rtol, maxiter, full_output, disp) 804 raise ValueError(f"rtol too small ({rtol:g} < {_rtol:g})") 805 f = _wrap_nan_raise(f) --> 806 r = _zeros._brentq(f, a, b, xtol, rtol, maxiter, args, full_output, disp) 807 return results_c(full_output, r, "brentq") 808

ValueError: f(a) and f(b) must have different signs

aangelopoulos commented 3 months ago

This is because $n$ is so small that it's causing a numerical exception in the solver. Can you try it with much larger $n$?

kitkhai commented 3 months ago

I tried with a larger N but still thrown the same error:

import numpy as np

# True labels
Y = np.array([1]*100000+[0]*100000)

# Predicted labels for labeled data
Yhat = np.array([1]*120000+[0]*80000)

# Predicted labels for unlabeled data
Yhat_unlabeled = np.array([1]*170000+[0]*30000)

# Number of classes
K = 2

nu = np.array([0, 1])

# Calling the function
result = ppi_distribution_label_shift_ci(Y, Yhat, Yhat_unlabeled, K, nu)
print("Confidence Interval for class 1 probability:", result)