会报错,报错如下:
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 569, in
main()
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 504, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 113, in train
outputs = model(inputs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, *kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(input, kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, *kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 897, in forward
head_mask=head_mask)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 606, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
您好,我有8个GPU可以使用,pytorch为1.5版本,cuda10.1
当我运行python classifier_pytorch/run_classifier.py时,参数如下: --model_type=bert --model_name_or_path=chinese_L-12_H-768_A-12 --task_name=iflytek --do_train --do_eval --do_lower_case --data_dir=clue_data/iflytek_public/ --max_seq_length=128 --per_gpu_train_batch_size=16 --per_gpu_eval_batch_size=16 --learning_rate=2e-5 --num_train_epochs=3.0 --logging_steps=759 --save_steps=759 --output_dir=outputs/iflytek_output/ --overwrite_output_dir --seed=42
会报错,报错如下: File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 569, in
main()
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 504, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/run_classifier.py", line 113, in train
outputs = model(inputs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, *kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(input, kwargs)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, *kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 897, in forward
head_mask=head_mask)
File "/home/yuangen_yu/anaconda3/envs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/home/yuangen_yu/CLUE/baselines/models_pytorch/classifier_pytorch/transformers/modeling_bert.py", line 606, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
但是当我指定一个GPU时,不会报错: os.environ["CUDA_VISIBLE_DEVICES"] = "0"
这个问题我运行huggingface的transformers库里的examples/gule.py案例也遇到过,初步怀疑是pytorch版本过高导致,当我pytroch版本降为1.2时,不指定GPU也不会报错。目前还不清楚torch1.5报错的确切原因。