aangelopoulos / ppi_py

A package for statistically rigorous scientific discovery using machine learning. Implements prediction-powered inference.
MIT License
205 stars 15 forks source link

possible issue with data splitting in all notebook examples #18

Closed listar2000 closed 1 month ago

listar2000 commented 1 month ago

Hi, I have a question regarding how data splitting has been done in the example notebooks. It is pretty obvious so I don't know whether it is a problem or is intentional. Take the galaxies.ipynb as example (I think this snippet is reused everywhere though):

# Run prediction-powered inference and classical inference for many values of n
results = []
for i in tqdm(range(ns.shape[0])):
    for j in range(num_trials):
        # Prediction-Powered Inference
        n = ns[i]
        rand_idx = np.random.permutation(n_total)
        _Yhat = Yhat_total[rand_idx[:n]]
        _Y = Y_total[rand_idx[:n]]
        #  _Yhat_unlabeled = Yhat_total[rand_idx[n:]] --> possible correction
        _Yhat_unlabeled = Yhat_total[n:] # --> this is not using the permuted indices?

        ppi_ci = ppi_mean_ci(_Y, _Yhat, _Yhat_unlabeled, alpha=alpha)

It seems that the _Yhat_unlabeled vector is not constructed through the randomly permuted indices for each trial, while the gold standard _Yhat does. Essentially no matter how many repetitions are made, the former vector stays the same.

aangelopoulos commented 1 month ago

You're totally right. My guess as to why this didn't cause an error is that $N$ is very large, so

Yhat_total[n:].mean() $\approx$ Yhat_total[rand_idx[n:]].mean()

This is a great find. Would you be willing to open a PR that

  1. Changes this in all notebooks
  2. Re-runs all notebooks, from top to bottom, with the new code

Thank you so much for your help @listar2000 !

listar2000 commented 1 month ago

Thanks for the quick response @aangelopoulos and I'm very willing to do that. Will send a PR soon.

aangelopoulos commented 1 month ago

Addressed by #19 ; Thank you @listar2000 !