TRAIS-Lab / dattri

`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.
https://trais-lab.github.io/dattri/
MIT License
28 stars 8 forks source link

[dattri.benchmark] upload the imangenet training script as well as a `dattri_retrain` command #51

Closed TheaperDeng closed 4 months ago

TheaperDeng commented 4 months ago

Description

1. Motivation and Context

This PR

  1. add a new dataset ImageNet to the benchmark module
  2. add a new entry point dattri_retrain.

2. Summary of the change

Here is some usage to the new entry point dattri_retrain.

Train ImageNet with LDS

dattri_retrain --dataset imagenet
               --model resnet18  # imagenet + resnet18 setting
               --mode lds  # lds mode
               --save_path ./experiments  # model save path
               --data_path ./data  # data loading/download path
               --partition 0,3,3  # from 0 to 3, with a total retain number 3
               --device cuda  # use cuda

Train MNIST 10 with LOO

dattri_retrain --dataset mnist
               --model lr # imagenet + LR setting
               --mode loo # lds mode
               --save_path ./experiments  # model save path
               --data_path ./data # data loading/download path
               --device cuda  # use cuda

Usage Guide

usage: dattri_retrain [-h] [--dataset {mnist,imagenet}] [--model {lr,resnet18}] [--mode {loo,lds}] [--save_path SAVE_PATH] [--data_path DATA_PATH] [--seed SEED]
                      [--partition PARTITION] [--device DEVICE] [--extra_param EXTRA_PARAM]

Retrain models on various datasets.

options:
  -h, --help            show this help message and exit
  --dataset {mnist,imagenet}
                        The dataset to use for retraining. It should be one of ['mnist', 'imagenet'].
  --model {lr,resnet18}
                        The dataset to use for retraining. It should be one of ['lr', 'resnet18'].
  --mode {loo,lds}      The retraining mode to use. It should be one of ['loo', 'lds'].
  --save_path SAVE_PATH
                        The path to save the retrained model.
  --data_path DATA_PATH
                        The path to the dataset.
  --seed SEED           The seed for retraining.
  --partition PARTITION
                        The partition for retraining, with the format [start, end, total]. This is used for parallel retraining. If the mode is 'lds', the partition should be        
                        [`start_id`, `start_id+subset_num`, `total_num_subsets`]. If the mode is 'loo', the partition should be [`start_id`, `end_id`, None], the third element is    
                        not used. The `indices` will be stated as range(`start_id`, `end_id`). Default value means the script will run all the data, that is 100 subsets for 'lds'    
                        and all the data for 'loo'.
  --device DEVICE       The device to train the model on.
  --extra_param EXTRA_PARAM
                        extra parameters to be passed to the retrain function. Must be in key=value format.

3. What tests have been added/updated for the change?

tingwl0122 commented 4 months ago

Hi @TheaperDeng, Should all the benchmark experiments share the same script?

TheaperDeng commented 4 months ago

Hi @TheaperDeng, Should all the benchmark experiments share the same script?

I should think so. But it seems hard for some complicated experiment settings e.g. nanogpt and music transformer. I guess it's OK to make those complicated ones have their own scripts.

For this PR, I will

  1. split the change of MNIST dataset to another PR
  2. refactor both PRs according to https://github.com/TRAIS-Lab/dattri/pull/53#issuecomment-2092014785
TheaperDeng commented 4 months ago

@tingwl0122 please also have a look, I think we may make "maestro_musictransformer" another setting later.

tingwl0122 commented 4 months ago

@tingwl0122 please also have a look, I think we may make "maestro_musictransformer" another setting later.

so directly pair up dataset and model?

TheaperDeng commented 4 months ago

@tingwl0122 please also have a look, I think we may make "maestro_musictransformer" another setting later.

so directly pair up dataset and model?

I think we can split --setting to --model and --dataset? And then we can assert if the combination is in our supported scope?

Actually --model and --dataset seems more clear?

tingwl0122 commented 4 months ago

@tingwl0122 please also have a look, I think we may make "maestro_musictransformer" another setting later.

so directly pair up dataset and model?

I think we can split --setting to --model and --dataset? And then we can assert if the combination is in our supported scope?

Actually --model and --dataset seems more clear?

I think so. So basically the name of the file should follow dataset_model.py under dattri/benchmark/dataset/ right