aangelopoulos / ppi_py

A package for statistically rigorous scientific discovery using machine learning. Implements prediction-powered inference.
MIT License
205 stars 15 forks source link

Can we have an example on how to use PPBoot to calculate mean ci? #13

Closed kitkhai closed 3 months ago

kitkhai commented 3 months ago

Hi I don't really understand how to use the PPBoot function when I want to calculate the mean CI.

Y = np.random.normal(0, 1, 100)
Yhat = Y + 2
Yhat_unlabeled = np.ones(10000) * 2

num_trials = 2
results = []
for j in range(num_trials):
    # Prediction-Powered Inference
    ppi_ci = ppboot(lambda y: y, Y, Yhat, Yhat_unlabeled)
# ValueError: operands could not be broadcast together with shapes (10000,) (100,) 

ps I used lambda y: y as an estimator as I rather not have to run my Machine Learning/LLM model to get the labels again? Or must I pass the model for it to do inference?

aangelopoulos commented 3 months ago

The estimator argument takes as an argument the estimator you would like to use. If you're hoping to estimate the population mean, the standard estimator would be the sample mean.

the stimator should be lambda y : y.mean(), since the sample mean is the estimator of the population mean.

kitkhai commented 3 months ago

That works! Thanks!

However, looking at the output, there are some things I don't really understand... The confidence interval of the mean is (-0.0032521653800733795, 0.0038037445287558255) which does not include the mean of my Yhat_unlabeled which is 2.

  1. What's the intuition behind this? I always just expect the mean of my sample to be within the confidence interval and hence thought something went wrong when I saw the output. I have a feeling it's related to the poor predictions of the labelled data (i.e the difference in Y & Yhat)?
  2. I guess as an extension, should I ever expect to see my confidence interval not include the mean of the labelled data?
aangelopoulos commented 3 months ago

It should contain the mean of Y, not the mean of Yhat. The Yhat variable is the synthetic data (ML generated), and it doesn't have the "right" mean. The Y data represents the small gold-standard dataset. So:

  1. It won't always contain the mean of Yhat. It should contain the mean of Y with high probability.
  2. With probability alpha this will fail to happen, but alpha is usually set to be small (e.g. 0.1).