da03 / Internalize_CoT_Step_by_Step

https://huggingface.co/spaces/yuntian-deng/implicit-cot-math
https://huggingface.co/spaces/yuntian-deng/gpt2-multiplication
MIT License
57 stars 3 forks source link

Could you provide the command to reproduce the results on GSM8k? #2

Open Ber666 opened 3 weeks ago

Ber666 commented 3 weeks ago

Thanks for this amazing work! Could you provide the command to reproduce the results on GSM8k?

Ber666 commented 3 weeks ago

I'm running the following script with the arguments specified in the paper (or the default value if not mentioned in the paper)

export FOLDER=data/gsm8k/
export MODEL=mistralai/Mistral-7B-v0.1
export EPOCHS=200
export LR=5e-5
export BSZ=16
export ACCUMULATE=2
export REMOVE_PER_EPOCH=8
export REMOVE_ALL_WHEN_REMOVE_BEYOND=39
export REMOVAL_SMOOTHING_LAMBDA=4
export REMOVAL_SIDE=left
export PRETRAIN_EPOCHS=0
export SEED=3456
export SAVE=train_models/gsm8k/mistral
mkdir -p $SAVE
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 stdbuf -oL -eL python src/train.py \
    --model ${MODEL} \
    --train_path ${FOLDER}/train.txt \
    --val_path ${FOLDER}/valid.txt \
    --epochs ${EPOCHS} \
    --lr ${LR} \
    --batch_size ${BSZ} \
    --accumulate ${ACCUMULATE} \
    --remove_per_epoch ${REMOVE_PER_EPOCH} \
    --remove_all_when_remove_beyond ${REMOVE_ALL_WHEN_REMOVE_BEYOND} \
    --removal_smoothing_lambda ${REMOVAL_SMOOTHING_LAMBDA} \
    --removal_side ${REMOVAL_SIDE} \
    --pretrain_epochs ${PRETRAIN_EPOCHS} \
    --seed ${SEED} \
    --reset_optimizer \
    --save_model ${SAVE} \
    > ${SAVE}/log.train 2>&1

Till now it has run 25 epochs, but the best accuracy on the validation set is only around 0.25 (epoch 12):

Disable Offset Val. PPL: 1.216820620826269; Accuracy: 0.252; Token Accuracy: 0.9625558257102966.
***best so far or removed more CoT tokens***
Saving to train_models/gsm8k/mistral/checkpoint_12

Could you take a look to see if anything is wrong with my setting? Thank you!

Ber666 commented 3 weeks ago

https://github.com/da03/Internalize_CoT_Step_by_Step/blob/a7a2d677bf9268b79f48f77953416b7a6d8bff99/src/train.py#L181

This line should be config = ImplicitModelConfig(base_model=args.model) instead?

da03 commented 2 weeks ago

https://github.com/da03/Internalize_CoT_Step_by_Step/blob/a7a2d677bf9268b79f48f77953416b7a6d8bff99/src/train.py#L181

This line should be config = ImplicitModelConfig(base_model=args.model) instead?

Thanks for your interest! Yes that's right! In our experiments we always started from a pretrained model, so this error went unnoticed. Besides, we also refactored the code to make it cleaner.

da03 commented 2 weeks ago

As for the command to reproduce the GSM8K results, since we refactored the code for public release, I need to convert the command and make sure results are reproducible before updating the README. However, for reproducibility, here's the original command we used to run the experiments in the paper:

export FOLDER=data/gsm8k
export MODEL=mistralai/Mistral-7B-v0.1
export EPOCHS=80
export LR=1e-5
export BSZ=16
export A=2
export SIDE=left
export PERE=8
export BEYOND=39
export TYPE=step
export LAMB=4
export PRETRAIN=0
export S=1234
export MAXLENT=150
export M=${MODEL#*/}
export SAVE=train_models/gsm8k_seed/gpt2/teacher_s${SIDE}_p${PERE}_b${BEYOND}_e${EPOCHS}_m${M}_bsz${BSZ}_a${A}_t${TYPE}_maxlent${MAXLENT}_lr${LR}_acc${A}_pretrain${PRETRAIN}_l${LAMB}_seed$S
echo $SAVE
mkdir -p $SAVE
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 stdbuf -oL -eL python src/train.py \
    --train_path ${FOLDER}/train.txt \
    --val_path ${FOLDER}/valid.txt \
    --test_path ${FOLDER}/test.txt \
    --epochs $EPOCHS \
    --delete_per_epoch $PERE \
    --pretrain_epoch $PRETRAIN \
    --lr $LR \
    --batch_size $BSZ \
    --base_model $MODEL \
    --delete_side $SIDE \
    --delete_beyond $BEYOND \
    --delete_type $TYPE \
    --lamb $LAMB \
    --seed $S \
    --reset_optimizer \
    --accumulate $A \
    --max_len_train $MAXLENT \
    --save_model /mnt/$SAVE \
    > ${SAVE}/log.train 2>&1

Compared to your command, two major things I noticed are: 1) the learning rate should be 1e-5 instead of 5e-5 (bf16 usually requires smaller learning rates); 2) bfloat16 should be used to avoid OOM (even on H100s with 80G GPU memory). Also, there is a padding bug that I need to fix in the current code, which I'll push in a few minutes.

da03 commented 2 weeks ago

(padding bug fix just pushed)

da03 commented 2 weeks ago

OK just converted to the new command format, can you pull the latest code and run the below command?

export FOLDER=data/gsm8k
export MODEL=mistralai/Mistral-7B-v0.1
export EPOCHS=80
export LR=1e-5
export BSZ=16
export ACCUMULATE=2
export REMOVE_PER_EPOCH=8
export REMOVE_ALL_WHEN_REMOVE_BEYOND=39
export MAX_LEN_TRAIN=150
export REMOVAL_SMOOTHING_LAMBDA=4
export REMOVAL_SIDE=left
export PRETRAIN_EPOCHS=0
export SEED=1234
export SAVE=train_models/gsm8k
mkdir -p $SAVE
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 stdbuf -oL -eL python src/train.py \
    --model ${MODEL} \
    --train_path ${FOLDER}/train.txt \
    --val_path ${FOLDER}/valid.txt \
    --epochs ${EPOCHS} \
    --lr ${LR} \
    --batch_size ${BSZ} \
    --accumulate ${ACCUMULATE} \
    --remove_per_epoch ${REMOVE_PER_EPOCH} \
    --remove_all_when_remove_beyond ${REMOVE_ALL_WHEN_REMOVE_BEYOND} \
    --removal_smoothing_lambda ${REMOVAL_SMOOTHING_LAMBDA} \
    --removal_side ${REMOVAL_SIDE} \
    --pretrain_epochs ${PRETRAIN_EPOCHS} \
    --seed ${SEED} \
    --reset_optimizer \
    --bf16 \
    --max_len_train ${MAX_LEN_TRAIN} \
    --save_model ${SAVE} \
    > ${SAVE}/log.train 2>&1
Ber666 commented 2 weeks ago

Thanks, it works! One more question: How many epochs does it typically take to get the best accuracy?

da03 commented 2 weeks ago

Great! Our released model is checkpoint_8, so it took 9 epochs in that case.