Closed ziyuwan closed 1 year ago
What is your decode config?
Please paste your train shell and test shell
What is your decode config? Please paste your train shell and test shell
I use the original code in test_7b_13b.sh
and I don't change anything in test.py
so I guess I'm using the default generation config.
For training I'm using the following scripts which is almost the same as train_7b.sh
:
export MODEL_PATH="meta-llama/Llama-2-7b-hf"
export SAVE_PATH="/data/ziyu/rft_model/llama2-7b-sft/"
export MASTER_ADDR="localhost"
export MASTER_PORT="1231"
export GLOO_SOCKET_IFNAME="lo"
export NCCL_SOCKET_IFNAME="lo"
export WANDB_DISABLED=true
wandb offline
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train.py \
--model_name_or_path $MODEL_PATH \
--data_path ./data/train_use.jsonl \
--bf16 True \
--output_dir $SAVE_PATH \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1085 \
--save_total_limit 40 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--cache_dir "/data/ziyu/hf_cache/huggingface/hub"
and I change the tokenizer path in train.py
:
https://github.com/OFA-Sys/gsm8k-ScRel/blob/f4d01761ec03d88a39486399c4617d29ee1dca7f/train.py#L264
'llama_model_hf/llama-7b',
-> "meta-llama/Llama-2-7b-hf",
I will try to reproduce your result. I never meet OOM with Bsz=4
Thanks a lot!
Hope my python environment config could help find the reason:
transformers 4.30.0
torch 2.0.1+cu117
My env: transformers 4.29.2 torch 1.12.1
The reason comes from test.py which use LLaMA1 tokenizer as the code, which use this set of config:
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"
You should use LLaMA1 tokenizer to train your model.
I use the LLaMA1 tokenizer with my uploaded code and bsz=4, gradient_accumulation_steps=4 and obtain 40.6.
ok, I'll try it. Thanks for your help
@GanjinZero I also have tried two settings of SFT on 8 * A800 80G
The first one is to train llama-1 (llama1 is based on https://huggingface.co/huggyllama/llama-7b) SFT using the token below, so I do not change the token set in the test.py
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "</s>"
DEFAULT_UNK_TOKEN = "</s>"
The only difference is I also meet OOM using your hyperparameter, so I set per_device_train_batch_size as 2 and gradient_accumulation_steps as 8. I get the following result: ./raw_generation_greedy.json 403 30.55344958301744 1319
The second one is to train llama-2 SFT using the token below, and I also change the corresponding token set in test.py:
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_UNK_TOKEN = "<unk>"
I get the following result: ./raw_generation_greedy.json 470 35.6330553449583 1319
@GanjinZero I also have tried two settings of SFT on 8 * A800 80G
The first one is to train llama-1 (llama1 is based on https://huggingface.co/huggyllama/llama-7b) SFT using the token below, so I do not change the token set in the test.py
DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "</s>" DEFAULT_BOS_TOKEN = "</s>" DEFAULT_UNK_TOKEN = "</s>"
The only difference is I also meet OOM using your hyperparameter, so I set per_device_train_batch_size as 2 and gradient_accumulation_steps as 8. I get the following result: ./raw_generation_greedy.json 403 30.55344958301744 1319
The second one is to train llama-2 SFT using the token below, and I also change the corresponding token set in test.py:
DEFAULT_PAD_TOKEN = "[PAD]" DEFAULT_EOS_TOKEN = "</s>" DEFAULT_BOS_TOKEN = "<s>" DEFAULT_UNK_TOKEN = "<unk>"
I get the following result: ./raw_generation_greedy.json 470 35.6330553449583 1319
That is really weird.
I am not sure whether torch's version or huggingface's version can influence this (Because we find that some bugs will show up using the huggingface version 4.31.0, so we revert back to 4.30.0), could you please provide a requirements.txt file with all the necessary dependencies?
My env: transformers 4.29.2 torch 1.12.1
Please refer to this now.
I will try to reproduce with bsz 4 4 and 2 8
bsz = 4 * 4, shows at most 75GB memory in GPU
gsm8k-sft-llama-7b-bsz-2-8/raw_generation_greedy_debug.json 462 35.02653525398029 1319 gsm8k-sft-llama-7b-bsz-4-4/raw_generation_greedy_debug.json 466 35.329795299469296 1319
gsm8k-sft-llama2-7b-bsz-2-8/raw_generation_greedy_debug.json 546 41.39499620924943 1319 gsm8k-sft-llama2-7b-bsz-4-4/raw_generation_greedy_debug.json 535 40.56103108415466 1319
I have really no idea why your code does not work. The only thing I notice is I constantly use LLaMA1-7B as my tokenizer during fine-tuning.
If you constantly got OOM, please check max_len.
I am trying your transformers version and torch version and get the following error:
ValueError: FSDP requires PyTorch >= 2.0.1
What is your accelerate version? to be honest, we will deeply appreciate it if you can offer us a complete list of dependencies.
I will provide it. Since our environment has some company related packages, I need remove them.
Sure, thanks for your help
absl-py==1.4.0 accelerate==0.21.0 addict==2.4.0 apex==0.1 cmake==3.18.2.post1 Cython @ file:///opt/conda/conda-bld/cython_1663692770955/work datasets==2.14.3 debugpy==1.6.7 decorator==5.1.1 editdistance==0.6.2 einops==0.6.1 flash-attn==0.2.8 horovod==0.24.3 huggingface-hub==0.14.1 ipdb==0.13.13 ipykernel==6.23.1 ipython==8.1.0 jsonlines==3.1.0 numba==0.53.1 numpy==1.22.2 pandas==1.5.3 pyarrow==12.0.1 scikit-learn==1.1.3 scipy==1.9.3 sentencepiece==0.1.99 tokenizers==0.13.3 torch @ 1.12.1 torch-cluster==1.6.0 torch-geometric==2.1.0.post1 torch-scatter==2.0.9 torch-sparse==0.6.15 torch-spline-conv==1.2.1 torchacc torchdata==0.4.1 torchvision==0.13.1+cu113 tornado==6.2 tqdm @ file:///opt/conda/conda-bld/tqdm_1664392687731/work traitlets==5.5.0 transformers==4.29.2 triton==1.0.0 typing_extensions==4.4.0 urllib3 @ file:///croot/urllib3_1666298941550/work wcwidth==0.2.5 xxhash==3.3.0 yacs==0.1.8 yapf==0.32.0 yarl==1.9.2 zipp==3.15.0 zstandard==0.21.0
After using the same environment as yours, we managed to reproduce the result. And the cuda OOM problem disappeared. Thanks and I'll close this issue.
Hello, I'm trying to reproduce your results for two settings with llama2-7b, but I cannot get as high scores as those mentioned in the paper.
Btw, while training on 8 Nvidia-A800-80g gpus, I always got torch.cuda.OutOfMemoryError. So I divide the
micro-batch-size-per-gpu
by 2 and double thegradient-accumulation-step
.Is this because we are using different GPUs/environments? Could you please share a
requirement.txt
about your environment or certain checkpoint/seeds to help reproducing your result.Thanks!