你好,使用四张A00,在训练过程中没有任何问题,但是在trainer.model.state_dict()的时候OOM。请问是不是得增加卡数呢?参数配置如下所示,模型是13B。十分感谢🙏
python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4 --use_env train.py \
--model_name_or_path $MODEL_PATH \
--data_path $DATA_PATH \
--output_dir $SAVE_PATH \
--bf16 True \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 40 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
--tf32 True --model_max_length 192 --rrhf_weight 1
UserWarning: Failed to clone() tensor with name lm_head.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of lm_head.weight. Error: CUDA out of memory. Tried to allocate 982.00 MiB (GPU 2; 79.35 GiB total capacity; 77.16 GiB already allocated; 272.19 MiB free; 77.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
你好,使用四张A00,在训练过程中没有任何问题,但是在trainer.model.state_dict()的时候OOM。请问是不是得增加卡数呢?参数配置如下所示,模型是13B。十分感谢🙏 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=4 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path $DATA_PATH \ --output_dir $SAVE_PATH \ --bf16 True \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \ --tf32 True --model_max_length 192 --rrhf_weight 1
UserWarning: Failed to clone() tensor with name lm_head.weight. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of lm_head.weight. Error: CUDA out of memory. Tried to allocate 982.00 MiB (GPU 2; 79.35 GiB total capacity; 77.16 GiB already allocated; 272.19 MiB free; 77.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF