TRAIS-Lab / dattri

`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.
https://trais-lab.github.io/dattri/
24 stars 8 forks source link

[dattri.benchmark] Add nanoGPT and retrain function here #60

Closed SeanZh30 closed 3 months ago

SeanZh30 commented 3 months ago

Description

The main purpose of this PR is to provide the code related to nanoGPT.

1. Motivation and Context

Under the benchmark folder, a benchmark/models/nanoGPT was created. This PR mainly provides the relevant code for nanoGPT retrain based on the shakespare_char dataset. For specific running method, please read benchmark/models/nanoGPT/readme.md. In addition, nanoGPT originally trained on dataset sampling. This PR tried to modify the original code, especially some functions in train.py (such as get_batch), and tried to modify it to train on all data. , which may be good to the implementation of data attribution function.

2. Summary of the change

  1. Add nanoGPT model-related code in dattri/benchmark/models/nanogpt/*
  2. Add retrain function based on shakespeare_char dataset in dattri/benchmark/shakespare.py

3. What tests have been added/updated for the change?

TheaperDeng commented 3 months ago

I will take a look first

jiaqima commented 3 months ago

@SeanZh30 probably better removing unnecessary files? e.g., the assets folder and the .ipynb files.

TheaperDeng commented 3 months ago

Now we support a new entry point

dattri_retrain_nanogpt --save_path ./experiment
                       --dataset 'shakespeare_char'/'tinystories'
                       --data_file TinyStoriesV2-GPT4-train.txt # optional, only valid for tinystories
                       --partition 0,5,5 # same as `dattri_retrain`
TheaperDeng commented 3 months ago

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

SeanZh30 commented 3 months ago

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

  • The model architecture is not changed
  • only lds mode retraining is supported for nanogpt
  • We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)
  • Add a new entrypoint: [dattri.benchmark] Add nanoGPT and retrain function here #60 (comment)
  • validate the generation quality
  • The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

I think everything looks fine but we may need to change the readme file in nanogpt to show the new entry.

TheaperDeng commented 3 months ago

Merge this PR to keep rolling