aangelopoulos / ppi_py

A package for statistically rigorous scientific discovery using machine learning. Implements prediction-powered inference.
MIT License
205 stars 15 forks source link

What is the motivation behind doing multiple trials with different size of the labelled data? #12

Closed kitkhai closed 3 months ago

kitkhai commented 3 months ago

Hi

With reference to your examples notebooks (for e.g. https://github.com/aangelopoulos/ppi_py/blob/main/examples/ballots.ipynb), why do you not just directly use the full labelled data? And instead calculate the PPI mean for the various size of labelled data, wouldn't the full set of labelled data be the most reliable and hence only the PPI mean with the full set of labelled data be calculated?

In specific, I'm referring to this part of the notebook:

# Run prediction-powered inference and classical inference for many values of n
results = []
for i in tqdm(range(ns.shape[0])):
    for j in range(num_trials):
        # Prediction-Powered Inference
        n = ns[i]
        rand_idx = np.random.permutation(n_max)
        _Yhat = Yhat[rand_idx[:n]]
        _Y = Y[rand_idx[:n]]
aangelopoulos commented 3 months ago

Ah, we’re doing that just to test what the effect of changing n is. Yes, the one with the largest n is the most reliable!! :)

On Tue, Jul 23, 2024 at 3:41 AM kitkhai @.***> wrote:

Hi

With reference to your examples notebooks (for e.g. https://github.com/aangelopoulos/ppi_py/blob/main/examples/ballots.ipynb), why do you not just directly use the full labelled data? And instead calculate the PPI mean for the various size of labelled data, wouldn't the full set of labelled data be the most reliable and hence only the PPI mean with the full set of labelled data be calculated?

In specific, I'm referring to this part of the notebook:

Run prediction-powered inference and classical inference for many values of n

results = [] for i in tqdm(range(ns.shape[0])): for j in range(num_trials):

Prediction-Powered Inference

    n = ns[i]
    rand_idx = np.random.permutation(n_max)
    _Yhat = Yhat[rand_idx[:n]]
    _Y = Y[rand_idx[:n]]

— Reply to this email directly, view it on GitHub https://github.com/aangelopoulos/ppi_py/issues/12, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBYOUXYX35JWKHPWJA4UWTZNWYC5AVCNFSM6AAAAABLJMCZHSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDIMBWGIZDIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kitkhai commented 3 months ago

Thinking about it, would the calculation of PPI work if I only have a small labelled dataset? Do you think bootstrapping the labelled data would help provide a better estimate, since the size of the labelled data is very small?

aangelopoulos commented 3 months ago

It still works, yeah! But if you want a bootstrap variant, we have one in the ppboot function :)

Check out Tijana's prediction powered bootstrap paper.