Open justinkay opened 3 months ago
Looks right to me! Thanks Justin!! Please go ahead and open the PR and I'll merge 😁
Best, Anastasios Angelopoulos https://people.eecs.berkeley.edu/~angelopoulos/
On Wed, Jul 24, 2024 at 7:22 PM Justin Kay @.***> wrote:
Hi @aangelopoulos https://github.com/aangelopoulos -- have finally been playing around with PPI!
I found a potential bug here: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1762
sorted_highlow will return the distribution ordered by class frequency, which may not be the same order as the classes in the confusion matrix. So the following line: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1765
will break unless the classes happen to already be ordered by frequency. In the plankton example https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/examples/plankton.ipynb this does not cause problems, since there are more data points with label 0 than 1. However if one were to flip this distribution, i.e.
Y = ~Y Yhat = ~Yhat Y_unlabeled = ~Y_unlabeled Yhat_unlabeled = ~Yhat_unlabeled
The PPI estimate breaks.
Removing sorted_highlow=True from this call seems to fix the issue.
Happy to submit a quick PR for this if my understanding is correct here.
— Reply to this email directly, view it on GitHub https://github.com/aangelopoulos/ppi_py/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBYOUS7IM6XSVRPHL6DUNLZN7PGLAVCNFSM6AAAAABLM5VFPCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDQMJQGY2DEMQ . You are receiving this because you were mentioned.Message ID: @.***>
Hi @aangelopoulos -- have finally been playing around with PPI!
I found a potential bug here: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1762
sorted_highlow
will return the distribution ordered by class frequency, which may not be the same order as the classes in the confusion matrix. So the following line: https://github.com/aangelopoulos/ppi_py/blob/ac99d9a9504d0189370e96a4e068717b18eeee2c/ppi_py/ppi.py#L1765will break unless the classes happen to already be ordered by frequency. In the plankton example this does not cause problems, since there are more data points with label 0 than 1. However if one were to flip this distribution, i.e.
The PPI estimate breaks.
Removing
sorted_highlow=True
from this call seems to fix the issue.Happy to submit a quick PR for this if my understanding is correct here.