API Design Proposal for AUC Function Implementation

sx-liu commented 5 months ago

Background

The benchmark needs a method to evaluate the effectiveness or precision of data attribution methods. One way to do so is to calculate the AUC (area under curve). The AUC measures the probability that a score randomly selected from a class of mislabeled data is greater than that of a class of clean data. To get an estimation of AUC, we can first manually introduce noises to our dataset by flipping the labels.

API Design

def flip_label(label: Union[np.ndarray, torch.Tensor], 
               label_space: Union[list, np.ndarray, torch.Tensor] = None,
               p: float = 0.1) -> Tuple[Union[np.ndarray, torch.Tensor], list]:
    """
    Randomly flip the labels with the given probability.

    If the labels are binary, it will just be filpped to its contrary.
    If the labels are multi-class, then a random false label will be picked to
    replace the original label. This is a way to manually introduce noises to 
    the dataset.

    :param label: A list of labels of shape (N,) from input data.
    :param label_space: A list of unique numbers indicating the valid range 
           of valid values for labels, can be omitted.
    :param p: Noise ratio, the ratio of flipped labels.
    :return: A tuple of a list of noisy labels with certain entries flipped 
     and a list of positions where the labels are flipped
    """

def noise_detection_auc(scores: Union[list, np.ndarray], noise_index: Union[list, np.ndarray]) -> Tuple[float, list]:
    """
    Given the list of influence scores and indices of noise data, calculate the AUC using sorting algorithm.

    :param scores: The self-attribution scores of shape (N,), generated from the method we want to evaluate
    :param noise_index: A list of indices where the labels are manually added noise (flipped).
    :return: A tuple of a float value indicating AUC calculated as a probability and a list of 
    detection rates at different threshold values.
    """

Demonstration

With pytorch datasets, one can randomly flip the labels like this,

mnist_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
labels_list = mnist_dataset.targets
flipped_labels, noise_index = flip_label(labels_list)
mnist_dataset.targets = flipped_labels

With the calculated influence scores and the noise index, one can evaluate auc like this,

auc, detection_rates = noise_detection_auc(scores, noise_index)
n_train = len(detection_rates)
plt.plot(100 * np.arange(n_train) / n_train,  detection_rates)

TODO

[x] Implement flip_label function
[x] Implement noise_detection_auc function
[x] Implement test cases for both functions
[ ] Train logistic regression and CNN on flipped MNIST dataset with p = 0.05, 0.1, 0.2

TheaperDeng commented 5 months ago

For flip_label:

The function is overall OK. Please implement it under dattri/datasets/utils.py. Please make label's specification clearer (e.g., what should the shape of tensor be like). Please change label_range to label_space and state the format of this parameter more clearly.

TheaperDeng commented 5 months ago

For evaluate_auc:

Call it noise_detection_auc.
Change IF_scores to score, there are many data attribution methods other than IF.
State that this score should be self-attribution score.

TheaperDeng commented 5 months ago

Please revise the issue again and AT me for review before you implement anything.

sx-liu commented 5 months ago

@TheaperDeng Please review the updated issue. Thanks!

TheaperDeng commented 5 months ago

What exactly does noise_detection_auc return? I suggest maybe it should be better if you could make it a tuple.

"""
Return (Tuple[float, Tuple[float, ...]]):
A tuple with 2 items. The first is the AUROC (or generally speaking, the AUC),
the second is a Tuple with `fpr, tpr, thresholds` just like
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html.
"""

Acturally, when you implement this function, do refer to the source code of sklearn (we would better not depend on sklearn). https://github.com/scikit-learn/scikit-learn/blob/f07e0138b/sklearn/metrics/_ranking.py#L1016-L1154

Others LGTM

jiaqima commented 5 months ago

Minor suggestion: noise_detection_auc -> mislabel_detection_auc?

TRAIS-Lab / dattri