在单卡A100上训练出现torch.distributed.elastic.multiprocessing.api.SignalException: Process 2920830 got signal: 1

Zhang-Each commented 1 year ago

计算资源有限，尝试在单卡A100上进行训练，训练刚开始可以正常进行，但是若干个step之后会出现WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2920926 closing signal SIGHUP torch.distributed.elastic.multiprocessing.api.SignalException: Process 2920830 got signal: 1

然后模型训练就突然中断了。

训练使用的shell脚本是 CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=1 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path $DATA_PATH \ --bf16 True \ --output_dir $SAVE_PATH \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 100 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap offload" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True --model_max_length 128 --rrhf_weight 1

请问这样的训练脚本是否有不妥？能否提供不用分布式训练的单卡训练脚本？

GanjinZero commented 1 year ago

我没有试过单卡，你可以尝试去掉fsdp的部分？或者研究一下开源的lora fine tune代码，并加入到rrhf的训练中。

Zhang-Each commented 11 months ago

该问题已解决，实际上导致问题的原因是显存不足，将训练数据集中过长的部分剔除之后可以正常运行（我使用的是另外的数据集）

GanjinZero / RRHF

在单卡A100上训练出现torch.distributed.elastic.multiprocessing.api.SignalException: Process 2920830 got signal: 1 #35