RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

gly99999 commented 2 years ago

我运行的命令是 CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test 配置文件也没有修改过，会出现RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

Traceback (most recent call last):
  File "train.py", line 87, in <module>
    student=config.create_student(nocrf=args.nocrf)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 235, in create_student
    return self.create_model(self.config,pretrained=self.load_pretrained(self.config), is_student=True)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 188, in create_model
    embeddings, word_map, char_map, lemma_map, postag_map=self.create_embeddings(config['embeddings'])
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 163, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1181, in __init__
    embedded_dummy = self.embed(dummy_sentence)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1218, in _add_embeddings_internal
    embeddings = self.ee.embed_batch(sentence_words)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 255, in embed_batch
    embeddings, mask = self.batch_to_embeddings(batch)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 197, in batch_to_embeddings
    bilm_output = self.elmo_bilm(character_ids)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 607, in forward
    token_embedding = self._token_embedder(inputs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 376, in forward
    convolved = conv(character_embedding)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

这个是我的cuda和torch版本，我的python是3.7.4的。

我试了在train.py禁用cudnn，

import torch
torch.backends.cudnn.enabled = False

出现的是这个问题

Traceback (most recent call last):
  File "train.py", line 88, in <module>
    student=config.create_student(nocrf=args.nocrf)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 235, in create_student
    return self.create_model(self.config,pretrained=self.load_pretrained(self.config), is_student=True)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 188, in create_model
    embeddings, word_map, char_map, lemma_map, postag_map=self.create_embeddings(config['embeddings'])
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 163, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1181, in __init__
    embedded_dummy = self.embed(dummy_sentence)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1218, in _add_embeddings_internal
    embeddings = self.ee.embed_batch(sentence_words)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 255, in embed_batch
    embeddings, mask = self.batch_to_embeddings(batch)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 197, in batch_to_embeddings
    bilm_output = self.elmo_bilm(character_ids)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 607, in forward
    token_embedding = self._token_embedder(inputs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 376, in forward
    convolved = conv(character_embedding)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

感谢回复~

wangxinyu0922 commented 2 years ago

看起来像是torch以及对应的cudatookit 装错了，建议上torch官网根据自己的cuda版本重新装一下试试看，版本1.3.1以上应该是都可以的。

gly99999 commented 2 years ago

我电脑的cuda是11.4的，我去官网安装了torch1.7.1和cudatookit 11.0 安装命令 pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html 出现错误 torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names 但是官网上torch版本比这个低的就没有cuda11.0以上的，那我是不是还要更换我系统的cuda版本

或者说我用CPU跑呢，需要更改哪里的代码，CPU跑这个命令需要多久呢 CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test

wangxinyu0922 commented 2 years ago

我用的是torch1.7.1+cu10.1好像没有什么问题，这个LSTM的报错是在哪里出现的呢？

不建议使用cpu，应该会非常久

gly99999 commented 2 years ago

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1406, in final_test
    self.model = self.model.load(base_path / "best-model.pt", device='cpu')
  File "/home/gly/python_workspace/ACE/flair/nn.py", line 106, in load
    model.to(device)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 160, in _apply
    self._flat_weights = [(lambda wn: getattr(self, wn) if hasattr(self, wn) else None)(wn) for wn in self._flat_weights_names]
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'

我的系统cuda是11.4应该会向下兼容的吧

wangxinyu0922 commented 2 years ago

这个应该是保存的模型里的LSTM1在1.3版本和1.7版本不兼容的问题，你可以先试试看不用--test的情况下能不能正常进行训练：

CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml

如果确实需要预先训练好的模型进行预测的话，建议还是想办法使用torch1.3.1，可以查询一下网上的一些解决方案，比如这个

gly99999 commented 2 years ago

这个是我不加--test直接训练的，还挺奇怪的。

2022-04-06 22:28:25,251 ================================== Start episode 1 ==================================
['/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased', '/home/yongjiang.jy/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000,
        0.5000, 0.5000], device='cuda:0', grad_fn=<SigmoidBackward>)
2022-04-06 22:28:25,260 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 686, in train
    loss = self.model.forward_loss(student_input)
  File "/home/gly/python_workspace/ACE/flair/models/sequence_tagger_model.py", line 1844, in forward_loss
    features = self.forward(data_points)
  File "/home/gly/python_workspace/ACE/flair/models/sequence_tagger_model.py", line 820, in forward
    self.embeddings.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 189, in embed
    embedding.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 661, in _add_embeddings_internal
    embeddings = self.embed_sentences(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 652, in embed_sentences
    pack_char_seqs = pack_padded_sequence(input=char_embeds, lengths=char_lengths, batch_first=False, enforce_sorted=False)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 244, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor
> /home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py(703)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) c
Traceback (most recent call last):
  File "train.py", line 360, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 703, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
UnboundLocalError: local variable 'loss' referenced before assignment

wangxinyu0922 commented 2 years ago

这个还是torch 1.3.1和1.7.1里LSTM函数不同导致的问题，我更新了代码修复了这个问题，你也可以直接修改你的flair/embeddings.py的652行：

pack_char_seqs = pack_padded_sequence(input=char_embeds, lengths=char_lengths.to('cpu'), batch_first=False, enforce_sorted=False)

gly99999 commented 2 years ago

你好，我修改代码之后可以训练了，我训练了几轮之后，然后ctrl+c终止训练，也看到我的模型保存了，然后我加--test运行出现这样的问题。😭

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1462, in final_test
    self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
  File "/home/gly/python_workspace/ACE/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
    embedding.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 2952, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 3041, in _add_embeddings_to_sentences
    subtokenized_sentence = self.tokenizer.tokenize(tokenized_string)

wangxinyu0922 commented 2 years ago

发个完整的Traceback看一下，这个我看不出来

gly99999 commented 2 years ago

这个可以吗，麻烦了

[2022-04-07 17:00:58,157 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/gly/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
2022-04-07 17:01:01,282 Testing using best model ...
2022-04-07 17:01:01,286 Setting embedding mask to the best action: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
['/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased', '/home/yongjiang.jy/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
2022-04-07 17:01:02,668 /home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt 43087046
2022-04-07 17:01:12,048 /home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt 43087046
2022-04-07 17:01:28,571 /home/gly/.flair/embeddings/news-backward-0.4.1.pt 18257500
2022-04-07 17:01:43,615 /home/gly/.flair/embeddings/news-forward-0.4.1.pt 18257500
2022-04-07 17:01:58,789 /home/yongjiang.jy/.cache/torch/transformers/bert-base-cased 108310272
2022-04-07 17:01:58,789 mean
Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1464, in final_test
    self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
  File "/home/gly/python_workspace/ACE/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
    embedding.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 2952, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 3041, in _add_embeddings_to_sentences
    subtokenized_sentence = self.tokenizer.tokenize(tokenized_string)
AttributeError: 'NoneType' object has no attribute 'tokenize'

wangxinyu0922 commented 2 years ago

修改了flair/trainer/reinforcement_trainer.py，你再试试看

gly99999 commented 2 years ago

改了之后发现我直接ctrl+c保存模型有这个问题，我重新把代码改回去好像还是有这个问题

2022-04-07 23:20:14,546 Exiting from training early.
2022-04-07 23:20:14,546 Saving model ...
2022-04-07 23:21:01,679 Done.
['/home/gly/.cache/torch/transformers/bert-base-cased', '/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/gly/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
tensor([True, True, True, True, True, True, True, True, True, True, True],
       device='cuda:0')
2022-04-07 23:21:01,806 Final State dictionary: {}
Traceback (most recent call last):
  File "train.py", line 360, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1097, in train
    self.model.selection=self.best_action
AttributeError: 'ReinforcementTrainer' object has no attribute 'best_action'

然后我加--test的话就是下面这个问题，找不到配置文件，最开始我是没有更改yaml文件里的embedding_name进行训练，原来embedding_name是/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased，然后出现的报错信息也是下面的不过说的是找不到这个/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased，我就想是不是之前训练的模型保存的embedding_name是/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased，所以有问题，然后我把embedding_name也修改成/home/gly/.cache/torch/transformers/bert-base-cased，还是出现下面的报错。我也删除过.cache目录重新试过了，还是一样，是不是我哪里的缓存还没清掉导致会有这个问题

[2022-04-07 23:24:59,695 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/gly/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
2022-04-07 23:25:07,784 Testing using best model ...
2022-04-07 23:25:07,857 Setting embedding mask to the best action: tensor([1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1.], device='cuda:0')
['/home/gly/.cache/torch/transformers/bert-base-cased', '/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/gly/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
Traceback (most recent call last):
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_utils.py", line 242, in get_config_dict
    raise EnvironmentError
OSError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1468, in final_test
    embedding.tokenizer = AutoTokenizer.from_pretrained(name, do_lower_case=True)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 206, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_auto.py", line 203, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_utils.py", line 251, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for '/home/gly/.cache/torch/transformers/bert-base-cased'. Make sure that:

- '/home/gly/.cache/torch/transformers/bert-base-cased' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/home/gly/.cache/torch/transformers/bert-base-cased' is the correct path to a directory containing a config.json file

wangxinyu0922 commented 2 years ago

第一个问题是你提前退出的太早了，模型在训练完第一个episode（不是epoch）得到模型accuracy之前不会保存best action。你可以复制一下预先训练好的模型里面的state 到你的模型保存路径试试看能不能跑起来

第二个问题，embedding_name是保证读取我预训练好的模型不会出错用的，你如果自己训练的话，所有的embedding_name可以删掉，要设定你的模型的路径应该是修改每个embedding下面的model，比如说

TransformerWordEmbeddings-1:
    model: /home/gly/.cache/torch/transformers/bert-base-cased 
    layers: -1,-2,-3,-4
    pooling_operation: mean

如果这种情况下还是读取不了embedding的话可能得确认一下/home/gly/.cache/torch/transformers/bert-base-cased路径下是不是你正确下载的模型，或者是只用model: bert-base-cased来让transformers自动读取他下载好的模型来用

gly99999 commented 2 years ago

现在可以了，感谢！

Alibaba-NLP / ACE

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #31