GanjinZero / RRHF

[NIPS2023] RRHF & Wombat
780 stars 49 forks source link

CUDA out of memory when trainer.model.state_dict() #30

Closed Akiraxty closed 1 year ago

Akiraxty commented 1 year ago

你好,使用四张A00,在训练过程中没有任何问题,但是在trainer.model.state_dict()的时候OOM。请问是不是得增加卡数呢?参数配置如下所示,模型是13B。十分感谢🙏 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path $DATA_PATH \ --output_dir $SAVE_PATH \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \ --tf32 True --model_max_length 192 --rrhf_weight 1 UserWarning: Failed to clone() tensor with name lm_head.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of lm_head.weight. Error: CUDA out of memory. Tried to allocate 982.00 MiB (GPU 2; 79.35 GiB total capacity; 77.16 GiB already allocated; 272.19 MiB free; 77.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GanjinZero commented 1 year ago

这个我在之前train alpaca的时候也遇到过类似的问题,你可以参考下这个https://github.com/lm-sys/FastChat/issues/256

Akiraxty commented 1 year ago

Thank you. I resolved this bug in https://github.com/tatsu-lab/stanford_alpaca/issues/81.