aangelopoulos / ppi_py

A package for statistically rigorous scientific discovery using machine learning. Implements prediction-powered inference.
MIT License
205 stars 15 forks source link

sorted_highlow=True in ppi_distribution_label_shift_ci -> form_discrete_distribution #14

Open justinkay opened 3 months ago

justinkay commented 3 months ago

Hi @aangelopoulos -- have finally been playing around with PPI!

I found a potential bug here: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1762

sorted_highlow will return the distribution ordered by class frequency, which may not be the same order as the classes in the confusion matrix. So the following line: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1765

will break unless the classes happen to already be ordered by frequency. In the plankton example this does not cause problems, since there are more data points with label 0 than 1. However if one were to flip this distribution, i.e.

Y = ~Y
Yhat = ~Yhat
Y_unlabeled = ~Y_unlabeled
Yhat_unlabeled = ~Yhat_unlabeled

The PPI estimate breaks.

Removing sorted_highlow=True from this call seems to fix the issue.

Happy to submit a quick PR for this if my understanding is correct here.

aangelopoulos commented 3 months ago

Looks right to me! Thanks Justin!! Please go ahead and open the PR and I'll merge 😁

Best, Anastasios Angelopoulos https://people.eecs.berkeley.edu/~angelopoulos/

On Wed, Jul 24, 2024 at 7:22 PM Justin Kay @.***> wrote:

Hi @aangelopoulos https://github.com/aangelopoulos -- have finally been playing around with PPI!

I found a potential bug here: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1762

sorted_highlow will return the distribution ordered by class frequency, which may not be the same order as the classes in the confusion matrix. So the following line: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1765

will break unless the classes happen to already be ordered by frequency. In the plankton example https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/examples/plankton.ipynb this does not cause problems, since there are more data points with label 0 than 1. However if one were to flip this distribution, i.e.

Y = ~Y Yhat = ~Yhat Y_unlabeled = ~Y_unlabeled Yhat_unlabeled = ~Yhat_unlabeled

The PPI estimate breaks.

Removing sorted_highlow=True from this call seems to fix the issue.

Happy to submit a quick PR for this if my understanding is correct here.

— Reply to this email directly, view it on GitHub https://github.com/aangelopoulos/ppi_py/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBYOUS7IM6XSVRPHL6DUNLZN7PGLAVCNFSM6AAAAABLM5VFPCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDQMJQGY2DEMQ . You are receiving this because you were mentioned.Message ID: @.***>