problems about reproducing llama7b2-sft and llama7b2-rft-100

ziyuwan commented 1 year ago

Hello, I'm trying to reproduce your results for two settings with llama2-7b, but I cannot get as high scores as those mentioned in the paper.

llama2-7b sft on gsm8k training set(7.4K, 3 epoch), 41.6% in the paper. I've tried training 2 times with testing scores 34.57% and 37.9%.
llama2-7b rft on rft-k=100(47K, 3 epoch), 47.5% in the paper. My testing score is 42.22%.

Btw, while training on 8 Nvidia-A800-80g gpus, I always got torch.cuda.OutOfMemoryError. So I divide the micro-batch-size-per-gpu by 2 and double the gradient-accumulation-step.

Is this because we are using different GPUs/environments? Could you please share a requirement.txt about your environment or certain checkpoint/seeds to help reproducing your result.

Thanks!

GanjinZero commented 1 year ago

What is your decode config?

GanjinZero commented 1 year ago

Please paste your train shell and test shell

ziyuwan commented 1 year ago

What is your decode config? Please paste your train shell and test shell

I use the original code in test_7b_13b.sh and I don't change anything in test.py so I guess I'm using the default generation config.

For training I'm using the following scripts which is almost the same as train_7b.sh:

export MODEL_PATH="meta-llama/Llama-2-7b-hf"
export SAVE_PATH="/data/ziyu/rft_model/llama2-7b-sft/"
export MASTER_ADDR="localhost"
export MASTER_PORT="1231"
export GLOO_SOCKET_IFNAME="lo"
export NCCL_SOCKET_IFNAME="lo"
export WANDB_DISABLED=true
wandb offline

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train.py \
    --model_name_or_path $MODEL_PATH \
    --data_path ./data/train_use.jsonl \
    --bf16 True \
    --output_dir $SAVE_PATH \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1085 \
    --save_total_limit 40 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --cache_dir "/data/ziyu/hf_cache/huggingface/hub"

and I change the tokenizer path in train.py: https://github.com/OFA-Sys/gsm8k-ScRel/blob/f4d01761ec03d88a39486399c4617d29ee1dca7f/train.py#L264

'llama_model_hf/llama-7b', -> "meta-llama/Llama-2-7b-hf",

GanjinZero commented 1 year ago

I will try to reproduce your result. I never meet OOM with Bsz=4

ziyuwan commented 1 year ago

Thanks a lot!

Hope my python environment config could help find the reason:

transformers                  4.30.0
torch                         2.0.1+cu117

GanjinZero commented 1 year ago

My env: transformers 4.29.2 torch 1.12.1

GanjinZero commented 1 year ago

The reason comes from test.py which use LLaMA1 tokenizer as the code, which use this set of config:

DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"

You should use LLaMA1 tokenizer to train your model.

I use the LLaMA1 tokenizer with my uploaded code and bsz=4, gradient_accumulation_steps=4 and obtain 40.6.

ziyuwan commented 1 year ago

ok, I'll try it. Thanks for your help

waterhorse1 commented 1 year ago

@GanjinZero I also have tried two settings of SFT on 8 * A800 80G

The first one is to train llama-1 (llama1 is based on https://huggingface.co/huggyllama/llama-7b) SFT using the token below, so I do not change the token set in the test.py

DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"

The only difference is I also meet OOM using your hyperparameter, so I set per_device_train_batch_size as 2 and gradient_accumulation_steps as 8. I get the following result: ./raw_generation_greedy.json 403 30.55344958301744 1319

The second one is to train llama-2 SFT using the token below, and I also change the corresponding token set in test.py:

DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_UNK_TOKEN = "<unk>"

I get the following result: ./raw_generation_greedy.json 470 35.6330553449583 1319

GanjinZero commented 1 year ago

@GanjinZero I also have tried two settings of SFT on 8 * A800 80G

The first one is to train llama-1 (llama1 is based on https://huggingface.co/huggyllama/llama-7b) SFT using the token below, so I do not change the token set in the test.py
DEFAULT_PAD_TOKEN = "[PAD]"

DEFAULT_EOS_TOKEN = "</s>"

DEFAULT_BOS_TOKEN = "</s>"

DEFAULT_UNK_TOKEN = "</s>"
The only difference is I also meet OOM using your hyperparameter, so I set per_device_train_batch_size as 2 and gradient_accumulation_steps as 8. I get the following result: ./raw_generation_greedy.json 403 30.55344958301744 1319

The second one is to train llama-2 SFT using the token below, and I also change the corresponding token set in test.py:
DEFAULT_PAD_TOKEN = "[PAD]"

DEFAULT_EOS_TOKEN = "</s>"

DEFAULT_BOS_TOKEN = "<s>"

DEFAULT_UNK_TOKEN = "<unk>"
I get the following result: ./raw_generation_greedy.json 470 35.6330553449583 1319

That is really weird.

waterhorse1 commented 1 year ago

I am not sure whether torch's version or huggingface's version can influence this (Because we find that some bugs will show up using the huggingface version 4.31.0, so we revert back to 4.30.0), could you please provide a requirements.txt file with all the necessary dependencies?

GanjinZero commented 1 year ago

My env: transformers 4.29.2 torch 1.12.1

Please refer to this now.

GanjinZero commented 1 year ago

I will try to reproduce with bsz 4 4 and 2 8

GanjinZero commented 1 year ago

bsz = 4 * 4, shows at most 75GB memory in GPU

GanjinZero commented 1 year ago

gsm8k-sft-llama-7b-bsz-2-8/raw_generation_greedy_debug.json 462 35.02653525398029 1319 gsm8k-sft-llama-7b-bsz-4-4/raw_generation_greedy_debug.json 466 35.329795299469296 1319

GanjinZero commented 1 year ago

gsm8k-sft-llama2-7b-bsz-2-8/raw_generation_greedy_debug.json 546 41.39499620924943 1319 gsm8k-sft-llama2-7b-bsz-4-4/raw_generation_greedy_debug.json 535 40.56103108415466 1319

GanjinZero commented 1 year ago

I have really no idea why your code does not work. The only thing I notice is I constantly use LLaMA1-7B as my tokenizer during fine-tuning.

GanjinZero commented 1 year ago

If you constantly got OOM, please check max_len.

waterhorse1 commented 1 year ago

I am trying your transformers version and torch version and get the following error:

ValueError: FSDP requires PyTorch >= 2.0.1

What is your accelerate version? to be honest, we will deeply appreciate it if you can offer us a complete list of dependencies.

GanjinZero commented 1 year ago

I will provide it. Since our environment has some company related packages, I need remove them.

waterhorse1 commented 1 year ago

Sure, thanks for your help

GanjinZero commented 1 year ago

absl-py==1.4.0 accelerate==0.21.0 addict==2.4.0 apex==0.1 cmake==3.18.2.post1 Cython @ file:///opt/conda/conda-bld/cython_1663692770955/work datasets==2.14.3 debugpy==1.6.7 decorator==5.1.1 editdistance==0.6.2 einops==0.6.1 flash-attn==0.2.8 horovod==0.24.3 huggingface-hub==0.14.1 ipdb==0.13.13 ipykernel==6.23.1 ipython==8.1.0 jsonlines==3.1.0 numba==0.53.1 numpy==1.22.2 pandas==1.5.3 pyarrow==12.0.1 scikit-learn==1.1.3 scipy==1.9.3 sentencepiece==0.1.99 tokenizers==0.13.3 torch @ 1.12.1 torch-cluster==1.6.0 torch-geometric==2.1.0.post1 torch-scatter==2.0.9 torch-sparse==0.6.15 torch-spline-conv==1.2.1 torchacc torchdata==0.4.1 torchvision==0.13.1+cu113 tornado==6.2 tqdm @ file:///opt/conda/conda-bld/tqdm_1664392687731/work traitlets==5.5.0 transformers==4.29.2 triton==1.0.0 typing_extensions==4.4.0 urllib3 @ file:///croot/urllib3_1666298941550/work wcwidth==0.2.5 xxhash==3.3.0 yacs==0.1.8 yapf==0.32.0 yarl==1.9.2 zipp==3.15.0 zstandard==0.21.0

ziyuwan commented 1 year ago

After using the same environment as yours, we managed to reproduce the result. And the cuda OOM problem disappeared. Thanks and I'll close this issue.

OFA-Sys / gsm8k-ScRel

problems about reproducing llama7b2-sft and llama7b2-rft-100 #9