Open zwRuan opened 5 months ago
I checked again and it was the evaluation code that would explode the video memory as the data increased. best_perplexity = evaluate_ppl(model, dev_set, tokenizer_llm, tokenizer_m2m, max_seq_len, max_gen_len, langs_map, augmentation) MindMerger.forward: output = self.model_llm(inputs_embeds=llm_input_embedding, attention_mask=llm_input_mask, labels=labels)
Thank you for your attention of our work!
From your description, it seems that the code can be run, but GPU out of memory? What kind of GPU do you use? I used A100 with 80G memory in my experiments.
You can try assigning a smaller batch size to each GPU, such as setting train_batch_size=128, train_micro_batch_size_per_gpu=1, gradient_accumulation=32. (Note that "train_batch_size=train_micro_batch_size_per_gpu gradient_accumulation the number of GPUs.)
Or try to lower the maximum sequence length, such as setting max_seq_len=100 max_gen_len=100. But it will degrade model performance.
Thank you for your answer.Could you please tell me how long it takes to train your model?
I used zero2_offload in the first stage, num_4090_gpu(4) per_gpu_batch_size(1) acc(32)=128, and the time was 3*10h (3 epochs, 10h for each epoch)
I used zero3_no_offload in the second stage, num_4090_gpu(4) per_gpu_batch_size(1) acc(32)=128, and the tqdm time showed 3*90h (3 epochs, 90h for each epoch)
I just changed the stage and offload parameters in your deepspeed_config
Is the time for the second stage a bit too long? Is this time a bit strange?
Hi,
When I use single A100, the first stage costs 6.5h per epoch, and the second stage costs 8.5h per epoch. We actually used 8 A100s to speed up our training. I haven't tested our method on the 4090, but it might be a little hard to train a model based on Llama-7B with 24GB GPUs.
To facilitate research, we plan to publish a smaller version of the model in the near future, implementing our method based on distilled Llama, such as MobileLLaMA-1.4B-Base.
thank you for this wonderful work deepspeed --master_port 50002 run_training.py --deepspeed --llm_path meta-math/MetaMath-7B-V1.0 \ --mt_path google/mt5-xl --stage_name mapping --train_num 100000 --train_batch_size 128 \ --train_micro_batch_size_per_gpu 8 --gradient_accumulation 4 --augmentation False --epoch_num 3 \ --gpu 0,1,2,3 --max_seq_len 200 --max_gen_len 200 --train_batch_size 128 --eval_batch_size 2 this is your bash file,and the gpu is 0,1,2,3, but when I use this code, the GPU is not used and the code is stuck at load checkpoint, and I noticed that a lot of information is printed 8 times. When I run run_training.py directly, the gpu memory keeps growing and then explodes