Closed gly99999 closed 2 years ago
看起来像是torch以及对应的cudatookit 装错了,建议上torch官网根据自己的cuda版本重新装一下试试看,版本1.3.1以上应该是都可以的。
我电脑的cuda是11.4的,我去官网安装了torch1.7.1和cudatookit 11.0
安装命令
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
出现错误
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names
但是官网上torch版本比这个低的就没有cuda11.0以上的,那我是不是还要更换我系统的cuda版本
或者说我用CPU跑呢,需要更改哪里的代码,CPU跑这个命令需要多久呢
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test
我用的是torch1.7.1+cu10.1好像没有什么问题,这个LSTM的报错是在哪里出现的呢?
不建议使用cpu,应该会非常久
Traceback (most recent call last):
File "train.py", line 163, in <module>
predict_posterior=args.predict_posterior,
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1406, in final_test
self.model = self.model.load(base_path / "best-model.pt", device='cpu')
File "/home/gly/python_workspace/ACE/flair/nn.py", line 106, in load
model.to(device)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 612, in to
return self._apply(convert)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
module._apply(fn)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 160, in _apply
self._flat_weights = [(lambda wn: getattr(self, wn) if hasattr(self, wn) else None)(wn) for wn in self._flat_weights_names]
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'
我的系统cuda是11.4应该会向下兼容的吧
这个应该是保存的模型里的LSTM1在1.3版本和1.7版本不兼容的问题,你可以先试试看不用--test
的情况下能不能正常进行训练:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml
如果确实需要预先训练好的模型进行预测的话,建议还是想办法使用torch1.3.1,可以查询一下网上的一些解决方案,比如这个
这个是我不加--test直接训练的,还挺奇怪的。
2022-04-06 22:28:25,251 ================================== Start episode 1 ==================================
['/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased', '/home/yongjiang.jy/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000,
0.5000, 0.5000], device='cuda:0', grad_fn=<SigmoidBackward>)
2022-04-06 22:28:25,260 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 686, in train
loss = self.model.forward_loss(student_input)
File "/home/gly/python_workspace/ACE/flair/models/sequence_tagger_model.py", line 1844, in forward_loss
features = self.forward(data_points)
File "/home/gly/python_workspace/ACE/flair/models/sequence_tagger_model.py", line 820, in forward
self.embeddings.embed(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 189, in embed
embedding.embed(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
self._add_embeddings_internal(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 661, in _add_embeddings_internal
embeddings = self.embed_sentences(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 652, in embed_sentences
pack_char_seqs = pack_padded_sequence(input=char_embeds, lengths=char_lengths, batch_first=False, enforce_sorted=False)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 244, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor
> /home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py(703)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) c
Traceback (most recent call last):
File "train.py", line 360, in <module>
getattr(trainer,'train')(**train_config)
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 703, in train
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
UnboundLocalError: local variable 'loss' referenced before assignment
这个还是torch 1.3.1和1.7.1里LSTM函数不同导致的问题,我更新了代码修复了这个问题,你也可以直接修改你的flair/embeddings.py
的652行:
pack_char_seqs = pack_padded_sequence(input=char_embeds, lengths=char_lengths.to('cpu'), batch_first=False, enforce_sorted=False)
你好,我修改代码之后可以训练了,我训练了几轮之后,然后ctrl+c终止训练,也看到我的模型保存了,然后我加--test运行出现这样的问题。😭
Traceback (most recent call last):
File "train.py", line 163, in <module>
predict_posterior=args.predict_posterior,
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1462, in final_test
self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
File "/home/gly/python_workspace/ACE/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
embedding.embed(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
self._add_embeddings_internal(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 2952, in _add_embeddings_internal
self._add_embeddings_to_sentences(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 3041, in _add_embeddings_to_sentences
subtokenized_sentence = self.tokenizer.tokenize(tokenized_string)
发个完整的Traceback看一下,这个我看不出来
这个可以吗,麻烦了
[2022-04-07 17:00:58,157 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/gly/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
2022-04-07 17:01:01,282 Testing using best model ...
2022-04-07 17:01:01,286 Setting embedding mask to the best action: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
['/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased', '/home/yongjiang.jy/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
2022-04-07 17:01:02,668 /home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt 43087046
2022-04-07 17:01:12,048 /home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt 43087046
2022-04-07 17:01:28,571 /home/gly/.flair/embeddings/news-backward-0.4.1.pt 18257500
2022-04-07 17:01:43,615 /home/gly/.flair/embeddings/news-forward-0.4.1.pt 18257500
2022-04-07 17:01:58,789 /home/yongjiang.jy/.cache/torch/transformers/bert-base-cased 108310272
2022-04-07 17:01:58,789 mean
Traceback (most recent call last):
File "train.py", line 163, in <module>
predict_posterior=args.predict_posterior,
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1464, in final_test
self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
File "/home/gly/python_workspace/ACE/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
embedding.embed(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
self._add_embeddings_internal(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 2952, in _add_embeddings_internal
self._add_embeddings_to_sentences(sentences)
File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 3041, in _add_embeddings_to_sentences
subtokenized_sentence = self.tokenizer.tokenize(tokenized_string)
AttributeError: 'NoneType' object has no attribute 'tokenize'
修改了flair/trainer/reinforcement_trainer.py
,你再试试看
改了之后发现我直接ctrl+c保存模型有这个问题,我重新把代码改回去好像还是有这个问题
2022-04-07 23:20:14,546 Exiting from training early.
2022-04-07 23:20:14,546 Saving model ...
2022-04-07 23:21:01,679 Done.
['/home/gly/.cache/torch/transformers/bert-base-cased', '/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/gly/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
tensor([True, True, True, True, True, True, True, True, True, True, True],
device='cuda:0')
2022-04-07 23:21:01,806 Final State dictionary: {}
Traceback (most recent call last):
File "train.py", line 360, in <module>
getattr(trainer,'train')(**train_config)
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1097, in train
self.model.selection=self.best_action
AttributeError: 'ReinforcementTrainer' object has no attribute 'best_action'
然后我加--test的话就是下面这个问题,找不到配置文件,最开始我是没有更改yaml文件里的embedding_name进行训练,原来embedding_name是/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased
,然后出现的报错信息也是下面的不过说的是找不到这个/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased
,我就想是不是之前训练的模型保存的embedding_name是/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased
,所以有问题,然后我把embedding_name也修改成/home/gly/.cache/torch/transformers/bert-base-cased
,还是出现下面的报错。我也删除过.cache目录重新试过了,还是一样,是不是我哪里的缓存还没清掉导致会有这个问题
[2022-04-07 23:24:59,695 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/gly/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
2022-04-07 23:25:07,784 Testing using best model ...
2022-04-07 23:25:07,857 Setting embedding mask to the best action: tensor([1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1.], device='cuda:0')
['/home/gly/.cache/torch/transformers/bert-base-cased', '/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/gly/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
Traceback (most recent call last):
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_utils.py", line 242, in get_config_dict
raise EnvironmentError
OSError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 163, in <module>
predict_posterior=args.predict_posterior,
File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1468, in final_test
embedding.tokenizer = AutoTokenizer.from_pretrained(name, do_lower_case=True)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 206, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_auto.py", line 203, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_utils.py", line 251, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for '/home/gly/.cache/torch/transformers/bert-base-cased'. Make sure that:
- '/home/gly/.cache/torch/transformers/bert-base-cased' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/home/gly/.cache/torch/transformers/bert-base-cased' is the correct path to a directory containing a config.json file
第一个问题是你提前退出的太早了,模型在训练完第一个episode(不是epoch)得到模型accuracy之前不会保存best action。你可以复制一下预先训练好的模型里面的state 到你的模型保存路径试试看能不能跑起来
第二个问题,embedding_name是保证读取我预训练好的模型不会出错用的,你如果自己训练的话,所有的embedding_name
可以删掉,要设定你的模型的路径应该是修改每个embedding下面的model
,比如说
TransformerWordEmbeddings-1:
model: /home/gly/.cache/torch/transformers/bert-base-cased
layers: -1,-2,-3,-4
pooling_operation: mean
如果这种情况下还是读取不了embedding的话可能得确认一下/home/gly/.cache/torch/transformers/bert-base-cased
路径下是不是你正确下载的模型,或者是只用model: bert-base-cased
来让transformers自动读取他下载好的模型来用
现在可以了,感谢!
我运行的命令是 CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test 配置文件也没有修改过,会出现RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED
这个是我的cuda和torch版本,我的python是3.7.4的。
我试了在train.py禁用cudnn,
出现的是这个问题
感谢回复~