单机2卡训练报错:
Traceback (most recent call last):
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 139, in
main()
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 135, in main
trainer.train_and_validate(args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 147, in train_and_validate
worker(args.local_rank, None, args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 732, in worker
trainer.train(args, local_rank, global_rank, train_loader, model_for_training, optimizer, scheduler)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 193, in train
batch = list(next(loader_iter))
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/utils/dataloader.py", line 187, in iter
yield torch.LongTensor(src), \
TypeError: an integer is required (got type NoneType)
单机2卡训练报错: Traceback (most recent call last): File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 139, in
main()
File "/home/d00620160/local/project/TencentPretrain/pretrain.py", line 135, in main
trainer.train_and_validate(args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 147, in train_and_validate
worker(args.local_rank, None, args)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 732, in worker
trainer.train(args, local_rank, global_rank, train_loader, model_for_training, optimizer, scheduler)
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/trainer.py", line 193, in train
batch = list(next(loader_iter))
File "/home/d00620160/local/project/TencentPretrain/tencentpretrain/utils/dataloader.py", line 187, in iter
yield torch.LongTensor(src), \
TypeError: an integer is required (got type NoneType)
训练命令如下:
CUDA_VISIBLE_DEVICES=6,7 deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 --pretrained_model_path models/llama2-7b.bin --dataset_path llama_support.pt --spm_model_path models/llama/tokenizer.model --config_path models/llama/7b_config.json --output_model_path models/llama_support_7b_dpw.bin --world_size 2 --gpu_ranks 0 1 --data_processor lm --deepspeed_checkpoint_activations --total_steps 300000 --save_checkpoint_steps 5000 --batch_size 1
这个错误的意思是数据有问题吗? 还是模型加载的有问题?