Closed kitkhai closed 3 months ago
Ah, we’re doing that just to test what the effect of changing n is. Yes, the one with the largest n is the most reliable!! :)
On Tue, Jul 23, 2024 at 3:41 AM kitkhai @.***> wrote:
Hi
With reference to your examples notebooks (for e.g. https://github.com/aangelopoulos/ppi_py/blob/main/examples/ballots.ipynb), why do you not just directly use the full labelled data? And instead calculate the PPI mean for the various size of labelled data, wouldn't the full set of labelled data be the most reliable and hence only the PPI mean with the full set of labelled data be calculated?
In specific, I'm referring to this part of the notebook:
Run prediction-powered inference and classical inference for many values of n
results = [] for i in tqdm(range(ns.shape[0])): for j in range(num_trials):
Prediction-Powered Inference
n = ns[i] rand_idx = np.random.permutation(n_max) _Yhat = Yhat[rand_idx[:n]] _Y = Y[rand_idx[:n]]
— Reply to this email directly, view it on GitHub https://github.com/aangelopoulos/ppi_py/issues/12, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBYOUXYX35JWKHPWJA4UWTZNWYC5AVCNFSM6AAAAABLJMCZHSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDIMBWGIZDIMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thinking about it, would the calculation of PPI work if I only have a small labelled dataset? Do you think bootstrapping the labelled data would help provide a better estimate, since the size of the labelled data is very small?
It still works, yeah! But if you want a bootstrap variant, we have one in the ppboot function :)
Check out Tijana's prediction powered bootstrap paper.
Hi
With reference to your examples notebooks (for e.g. https://github.com/aangelopoulos/ppi_py/blob/main/examples/ballots.ipynb), why do you not just directly use the full labelled data? And instead calculate the PPI mean for the various size of labelled data, wouldn't the full set of labelled data be the most reliable and hence only the PPI mean with the full set of labelled data be calculated?
In specific, I'm referring to this part of the notebook: