[dattri.benchmark] Add nanoGPT and retrain function here

TRAIS-Lab / dattri

`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.

https://trais-lab.github.io/dattri/

24 stars 8 forks source link

[dattri.benchmark] Add nanoGPT and retrain function here #60

Closed SeanZh30 closed 3 months ago

SeanZh30 commented 3 months ago

Description

The main purpose of this PR is to provide the code related to nanoGPT.

1. Motivation and Context

Under the benchmark folder, a benchmark/models/nanoGPT was created. This PR mainly provides the relevant code for nanoGPT retrain based on the shakespare_char dataset. For specific running method, please read benchmark/models/nanoGPT/readme.md. In addition, nanoGPT originally trained on dataset sampling. This PR tried to modify the original code, especially some functions in train.py (such as get_batch), and tried to modify it to train on all data. , which may be good to the implementation of data attribution function.

2. Summary of the change

Add nanoGPT model-related code in dattri/benchmark/models/nanogpt/*
Add retrain function based on shakespeare_char dataset in dattri/benchmark/shakespare.py

3. What tests have been added/updated for the change?

[x] N/A: No test will be added.

TheaperDeng commented 3 months ago

I will take a look first

jiaqima commented 3 months ago

@SeanZh30 probably better removing unnecessary files? e.g., the assets folder and the .ipynb files.

TheaperDeng commented 3 months ago

Now we support a new entry point

dattri_retrain_nanogpt --save_path ./experiment
                       --dataset 'shakespeare_char'/'tinystories'
                       --data_file TinyStoriesV2-GPT4-train.txt # optional, only valid for tinystories
                       --partition 0,5,5 # same as `dattri_retrain`

TheaperDeng commented 3 months ago

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

The model architecture is not changed
only lds mode retraining is supported for nanogpt
We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)
Add a new entrypoint: https://github.com/TRAIS-Lab/dattri/pull/60#issuecomment-2103823548
validate the generation quality
The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

SeanZh30 commented 3 months ago

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

The model architecture is not changed

only lds mode retraining is supported for nanogpt

We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)

Add a new entrypoint: [dattri.benchmark] Add nanoGPT and retrain function here #60 (comment)

validate the generation quality

The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

I think everything looks fine but we may need to change the readme file in nanogpt to show the new entry.

TheaperDeng commented 3 months ago

Merge this PR to keep rolling