Open Ber666 opened 3 weeks ago
I'm running the following script with the arguments specified in the paper (or the default value if not mentioned in the paper)
export FOLDER=data/gsm8k/
export MODEL=mistralai/Mistral-7B-v0.1
export EPOCHS=200
export LR=5e-5
export BSZ=16
export ACCUMULATE=2
export REMOVE_PER_EPOCH=8
export REMOVE_ALL_WHEN_REMOVE_BEYOND=39
export REMOVAL_SMOOTHING_LAMBDA=4
export REMOVAL_SIDE=left
export PRETRAIN_EPOCHS=0
export SEED=3456
export SAVE=train_models/gsm8k/mistral
mkdir -p $SAVE
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 stdbuf -oL -eL python src/train.py \
--model ${MODEL} \
--train_path ${FOLDER}/train.txt \
--val_path ${FOLDER}/valid.txt \
--epochs ${EPOCHS} \
--lr ${LR} \
--batch_size ${BSZ} \
--accumulate ${ACCUMULATE} \
--remove_per_epoch ${REMOVE_PER_EPOCH} \
--remove_all_when_remove_beyond ${REMOVE_ALL_WHEN_REMOVE_BEYOND} \
--removal_smoothing_lambda ${REMOVAL_SMOOTHING_LAMBDA} \
--removal_side ${REMOVAL_SIDE} \
--pretrain_epochs ${PRETRAIN_EPOCHS} \
--seed ${SEED} \
--reset_optimizer \
--save_model ${SAVE} \
> ${SAVE}/log.train 2>&1
Till now it has run 25 epochs, but the best accuracy on the validation set is only around 0.25 (epoch 12):
Disable Offset Val. PPL: 1.216820620826269; Accuracy: 0.252; Token Accuracy: 0.9625558257102966.
***best so far or removed more CoT tokens***
Saving to train_models/gsm8k/mistral/checkpoint_12
Could you take a look to see if anything is wrong with my setting? Thank you!
This line should be config = ImplicitModelConfig(base_model=args.model)
instead?
This line should be
config = ImplicitModelConfig(base_model=args.model)
instead?
Thanks for your interest! Yes that's right! In our experiments we always started from a pretrained model, so this error went unnoticed. Besides, we also refactored the code to make it cleaner.
As for the command to reproduce the GSM8K results, since we refactored the code for public release, I need to convert the command and make sure results are reproducible before updating the README. However, for reproducibility, here's the original command we used to run the experiments in the paper:
export FOLDER=data/gsm8k
export MODEL=mistralai/Mistral-7B-v0.1
export EPOCHS=80
export LR=1e-5
export BSZ=16
export A=2
export SIDE=left
export PERE=8
export BEYOND=39
export TYPE=step
export LAMB=4
export PRETRAIN=0
export S=1234
export MAXLENT=150
export M=${MODEL#*/}
export SAVE=train_models/gsm8k_seed/gpt2/teacher_s${SIDE}_p${PERE}_b${BEYOND}_e${EPOCHS}_m${M}_bsz${BSZ}_a${A}_t${TYPE}_maxlent${MAXLENT}_lr${LR}_acc${A}_pretrain${PRETRAIN}_l${LAMB}_seed$S
echo $SAVE
mkdir -p $SAVE
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 stdbuf -oL -eL python src/train.py \
--train_path ${FOLDER}/train.txt \
--val_path ${FOLDER}/valid.txt \
--test_path ${FOLDER}/test.txt \
--epochs $EPOCHS \
--delete_per_epoch $PERE \
--pretrain_epoch $PRETRAIN \
--lr $LR \
--batch_size $BSZ \
--base_model $MODEL \
--delete_side $SIDE \
--delete_beyond $BEYOND \
--delete_type $TYPE \
--lamb $LAMB \
--seed $S \
--reset_optimizer \
--accumulate $A \
--max_len_train $MAXLENT \
--save_model /mnt/$SAVE \
> ${SAVE}/log.train 2>&1
Compared to your command, two major things I noticed are: 1) the learning rate should be 1e-5 instead of 5e-5 (bf16 usually requires smaller learning rates); 2) bfloat16 should be used to avoid OOM (even on H100s with 80G GPU memory). Also, there is a padding bug that I need to fix in the current code, which I'll push in a few minutes.
(padding bug fix just pushed)
OK just converted to the new command format, can you pull the latest code and run the below command?
export FOLDER=data/gsm8k
export MODEL=mistralai/Mistral-7B-v0.1
export EPOCHS=80
export LR=1e-5
export BSZ=16
export ACCUMULATE=2
export REMOVE_PER_EPOCH=8
export REMOVE_ALL_WHEN_REMOVE_BEYOND=39
export MAX_LEN_TRAIN=150
export REMOVAL_SMOOTHING_LAMBDA=4
export REMOVAL_SIDE=left
export PRETRAIN_EPOCHS=0
export SEED=1234
export SAVE=train_models/gsm8k
mkdir -p $SAVE
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 stdbuf -oL -eL python src/train.py \
--model ${MODEL} \
--train_path ${FOLDER}/train.txt \
--val_path ${FOLDER}/valid.txt \
--epochs ${EPOCHS} \
--lr ${LR} \
--batch_size ${BSZ} \
--accumulate ${ACCUMULATE} \
--remove_per_epoch ${REMOVE_PER_EPOCH} \
--remove_all_when_remove_beyond ${REMOVE_ALL_WHEN_REMOVE_BEYOND} \
--removal_smoothing_lambda ${REMOVAL_SMOOTHING_LAMBDA} \
--removal_side ${REMOVAL_SIDE} \
--pretrain_epochs ${PRETRAIN_EPOCHS} \
--seed ${SEED} \
--reset_optimizer \
--bf16 \
--max_len_train ${MAX_LEN_TRAIN} \
--save_model ${SAVE} \
> ${SAVE}/log.train 2>&1
Thanks, it works! One more question: How many epochs does it typically take to get the best accuracy?
Great! Our released model is checkpoint_8, so it took 9 epochs in that case.
Thanks for this amazing work! Could you provide the command to reproduce the results on GSM8k?