BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
12.32k stars 838 forks source link

OOM issue for both CPU and GPU in 1B5 model training #117

Closed fubincom closed 1 year ago

fubincom commented 1 year ago

Hello, I try to finetune the RWKV 1B5 model using this cmd: train.py --load_model ${FILE_DIR}/gpt_model/RWKV-4-Raven-1B5-v11.pth --wandb "" --proj_dir "out"\ --data_file ${FILE_DIR}/train.npy --data_type "numpy" --vocab_size 50277\ --ctx_len 1024 --epoch_steps 5 --epoch_count 5 --epoch_begin 0 --epoch_save 2 \ --micro_bsz 2 --n_layer 24 --n_embd 2048 --pre_ffn 0 --head_qk 0 \ --lr_init 1e-5 --lr_final 1e-5 --warmup_steps 0 --beta1 0.9 --beta2 0.999 --adam_eps 1e-8 \ --precision fp16 --strategy deepspeed_stage_2_offload --accelerator gpu --grad_cp 1 --devices 8

My GPUs are V100 has 32G GPU memory, 40G CPU memory for each machine. But I can only start training in 2 steps(I set more steps in 1 epoch, it will also be killed in second step) then the training was killed due to OOM in CPU memory if I use deepspped_stage_2_offload. Is this something wrong in my experiment setting, or I need to prepare more CPU memory for that?

Really appreciate your help.

Screenshot 2023-05-18 at 13 08 08
BlinkDL commented 1 year ago

use --strategy deepspeed_stage_2 you have enough vram so don't use offload