Challenges in Replicating Model Pruning and Unlearning Results

Davidjinzb commented 9 months ago

Hello,

I've been working on replicating some results from your paper using the provided commands and code modifications in the README. However, I am encountering some discrepancies in the results, particularly with the MIA-Efficacy values. Below, I detail the steps taken and the issues encountered.

Steps and Code Used:

Initial pruning of the model was done using the command:

python -u main_imp.py --data ./data --dataset cifar10 --arch resnet18 --prune_type rewind_lt --rewind_epoch 8 --save_dir omp --rate 0.95 --pruning_times 2 --num_workers 8

I modified arg_parser.py with the following additions:

parser.add_argument(
    "--num_indexes_to_replace",
    type=int,
    default=None,
    help="Number of data to forget",
)
parser.add_argument(
    "--class_to_replace", type=int, default=None, help="Specific class to forget"
)
parser.add_argument(
    "--indexes_to_replace",
    type=list,
    default=None,
    help="Specific index data to forget",
)

When class_to_replace is set to None, a random selection of indexes equal to the number specified by num_indexes_to_replace will be chosen for the unlearning process.

args = arg_parser.parse_args()
    if args.seed:
        utils.setup_seed(args.seed)
    if args.class_to_replace is None:
        if args.dataset == "cifar10":
            num_indexes_to_replace = args.num_indexes_to_replace
            if args.indexes_to_replace is None:
                args.indexes_to_replace = np.random.choice(
                    45000, num_indexes_to_replace, replace=False
                )

I used the following command to unlearning

python -u main_forget.py --save_dir omp --mask omp/1model_SA_best.pth.tar --unlearn retrain --num_indexes_to_replace 4500 --unlearn_epochs 160 --unlearn_lr 0.1

As shown in Figure 1, the results obtained under the 95%-sparse model are calculated as follows: UA=6.78, RA=99.99, TA=92.77. Figure 1

I would like to inquire which value in SVC_MIA_forget_efficacy represents MIA-Efficacy. Is it the confidence value that closely matches the one mentioned in the paper, or is it the average of these values?

I used the following command to get the unlearning result of the Dense model.
```
python -u main_forget.py --save_dir omp_dense --mask omp_dense/0model_SA_best.pth.tar --unlearn retrain --num_indexes_to_replace 4500 --unlearn_epochs 160 --unlearn_lr 0.1
```
As shown in Figure 2, the results under the Dense model are calculated as follows: UA=4.9, RA=99.52, TA=94.62. Figure 2

The above results are relatively close to those reported in the paper; however, I conducted separate tests on GA for both the 95%-sparse model and the Dense Model by the following commands: sparse model command:

python -u main_forget.py --save_dir omp --mask omp/1model_SA_best.pth.tar --unlearn GA --num_indexes_to_replace 4500 --unlearn_lr 0.0001 --unlearn_epochs 5

Dense Model command:

python -u main_forget.py --save_dir omp_dense --mask omp_dense/0model_SA_best.pth.tar --unlearn GA --num_indexes_to_replace 4500 --unlearn_lr 0.0001 --unlearn_epochs 5

As shown in Figure 3, the results for the 95%-sparse model are calculated as follows: UA=0.62, RA=99.39, TA=94.23. The UA value differs significantly from the 5.62±0.46 reported in the paper. Additionally, the MIA-Efficacy, whether it's the average or a specific value, shows a considerable discrepancy from the reported 11.76±0.52. Figure 3 95%-sparse model result

As illustrated in Figure 4, the results for the Dense Model are calculated as follows: UA=0.78, RA=99.52, TA=94.52. The UA value shows a significant difference from the 7.54±0.29 mentioned in the paper. Moreover, the average MIA-Efficacy is 8.5, which slightly deviates from the 10.04±0.31 reported in the paper. Figure 4 Dense Model result

While running FF under the 95%-sparse model, since the specific value for alpha was not known, we set it to 10^(-8) based on the description in the 'Additional training details of MU' section of the paper. The result is shown in Figure 5.
```
python -u main_forget.py --save_dir omp --mask omp/1model_SA_best.pth.tar --unlearn fisher_new --num_indexes_to_replace 4500 --alpha 0.00000001
```
Figure 5 As shown in Figure 5, the results exhibit some discrepancies compared to the results shown in Figure 6 from the paper.

Figure 6

Questions:

What are the specific parameter settings for each unlearning method?
How is MIA-Efficacy calculated in SVC_MIA_forget_efficacy?
What could be the reasons for the discrepancies in replicating the results?

Thank you for your time and help.

Best regards, David

jinghanjia commented 9 months ago

Thank you for your queries. Here are my responses to each of your points:

Regarding the parameter settings for the GA, I acknowledge there seems to be a typo or error in the paper. The learning rate of 1e-4 is only the default for class-wise unlearning in the context of CIFAR10 and ResNet18. We plan to correct and clarify this in the appendix. It's important to note that GA is quite sensitive to the learning rate, particularly when it comes to unlearning random data. We employed a grid search in the range of [0.1, 1e-5] to identify the most appropriate learning rate that minimizes the gap between retraining and GA. In light of your results, as seen in Figure 3 and Figure 4 where the Unlearning Accuracy (UA) is relatively low, it suggests that the learning rate might be too small for effective unlearning. I recommend increasing the learning rate. Additionally, for Fisher unlearning, you might consider lowering the alpha, possibly to around 1e-9. Because the unlearned model's model TA is only 5.03%, which means alpha is too large.
For MIA efficacy, we opted to use the confidence value as the metric.
The primary factor contributing to discrepancies in replicating results seems to be the hyperparameters. The unlearning methods, including GA and Fisher, are highly sensitive to these parameters, and this sensitivity can vary across different random seeds and initial models. This hyperparameter sensitivity is precisely why we utilized a grid search strategy to identify a reasonably unlearned model.

I hope these responses clarify your concerns and aid in your further experiments.

Davidjinzb commented 9 months ago

Thank you very much for your prompt and timely response, which has been of great help to us. I am also looking forward to the content of your appendix. Could you possibly give us an idea of when the appendix might be ready? We are preparing a review paper and, for its rigor, we wish to replicate some of the studies. If possible, could you provide us in advance with the part of your work regarding the hyperparameter settings?

Thank you in advance for your consideration.

jinghanjia commented 9 months ago

Thank you for your attention to the details of our paper. We will update it ASAP.

Due to the high sensitivity of GA, FF, and IU methods to hyperparameters, which varies with different random seeds and initial models, we tailor the hyperparameters accordingly for each scenario:

FT: A learning rate of around 0.04 is used for random forgetting.

GA: Conducted grid searches between [0.01, 1e-3] for various seeds and initial models, a range determined to be effective for GA.

FF: Performed grid searches between [1e-9, 1e-8] for random data forgetting, based on our findings.

IU: Utilized a grid search range of [0.1, 5], as indicated by our experimental results.

I hope these responses clarify your concerns and aid in your further experiments.

Thank you~

Davidjinzb commented 9 months ago

Hi Jinghan,

Thank you for providing the range of parameters; it's been immensely helpful for our work. However, we've tested all the results for FF from 1e-9 to 1e-8, and they still show a deviation from the results in your paper. Below are the screenshots of our replication attempts. The numbers following "omp" in these figures represent the learning rates.

B$_OW(XDM77AALB(HUS_XU2 Figure 1: Results of [1e-9, 2e-9, 3e-9]

2 8E$_`1YFR6EC~PW8C33AR Figure 2: Results of [4e-9, 5e-9, 6e-9]

$P96NA_5{6__BZ)${7UNU)R$ Figure 3: Results of [7e-9, 8e-9, 9e-9]

We would greatly appreciate any additional insights or suggestions you might have to explain these discrepancies.

Thank you!

jinghanjia commented 9 months ago

Thank you for sharing your results. Apologies for any confusion regarding the hyperparameters. The range [1e-9, 1e-8] should actually be applied to alpha, not the learning rate. In the Fisher method, we don't use the learning rate; alpha is the sole hyperparameter that requires tuning. For example, you can adjust your Fisher scripts by appending '--alpha 1e-9' to them.

If you encounter any further issues, please don't hesitate to reach out.

Davidjinzb commented 8 months ago

Hey Jinghan

Thanks for your reply, and I'm very sorry for the misunderstanding; it seems there was a miscommunication in my last reply. Indeed, we have adjusted the alpha parameter as per your instructions. Use the following command:

python -u main_forget.py --save_dir ${save_dir} --mask ${mask_path} --unlearn fisher_new --num_indexes_to_replace 4500 --alpha ${alpha}

jinghanjia commented 8 months ago

Thank you for the clarification. Could you provide the Fisher results for the dense model as well? By the way, for random forgetting, we use 'class_to_replace = -1'. Can you replicate the results using the following script from the original codebase: python -u main_forget.py --save_dir test --mask $mask_path --unlearn fisher_new --num_indexes_to_replace 4500 --class_to_replace -1 --seed 1 --alpha 1e-9

ljcc0930 commented 4 months ago

Issue closed. Please don't hesitate to raise another issue if you have any more questions, thanks.

OPTML-Group / Unlearn-Sparse

Challenges in Replicating Model Pruning and Unlearning Results #5

Steps and Code Used:

Questions: