Rasa2.0 training with GPU failed. Invalid argument

jk123vip commented 3 years ago

Rasa version: 2.8.1

Rasa SDK version (if used & relevant): 2.8.1

Rasa X version (if used & relevant):

Python version: 3.7.10

Operating system (windows, osx, ...): Windows10 & CentOS

Issue: tensorflow-gpu version: 2.3.0 CUDA version: 10.1.105 Training with GPU using "rasa train" comes with some strange error, but it works perfectly while using CPU.

Error (including full traceback):

>rasa train
2021-08-02 12:18:13 INFO     rasa.model  - Data (domain) for Core model section changed.
2021-08-02 12:18:13 INFO     rasa.model  - Data (messages) for NLU model section changed.
Training NLU model...
2021-08-02 12:18:15 INFO     transformers.file_utils  - PyTorch version 1.9.0+cpu available.
2021-08-02 12:18:15 INFO     transformers.file_utils  - TensorFlow version 2.3.3 available.
2021-08-02 12:18:15 INFO     transformers.tokenization_utils  - Model name 'hfl/chinese_roberta_wwm_ext' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'hfl/chinese_roberta_wwm_ext' is a path, a model identifier, or url to a directory containing tokenizer files.
2021-08-02 12:18:15 INFO     transformers.tokenization_utils  - Didn't find file hfl/chinese_roberta_wwm_ext\added_tokens.json. We won't load it.    
2021-08-02 12:18:15 INFO     transformers.tokenization_utils  - Didn't find file hfl/chinese_roberta_wwm_ext\special_tokens_map.json. We won't load it.
2021-08-02 12:18:15 INFO     transformers.tokenization_utils  - Didn't find file hfl/chinese_roberta_wwm_ext\tokenizer_config.json. We won't load it.2021-08-02 12:18:15 INFO     transformers.tokenization_utils  - loading file hfl/chinese_roberta_wwm_ext\vocab.txt
2021-08-02 12:18:15 INFO     transformers.tokenization_utils  - loading file None
2021-08-02 12:18:15 INFO     transformers.configuration_utils  - loading configuration file hfl/chinese_roberta_wwm_ext\config.json
2021-08-02 12:18:15 INFO     transformers.configuration_utils  - Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 21128
}

2021-08-02 12:18:15 INFO     transformers.modeling_tf_utils  - loading weights file hfl/chinese_roberta_wwm_ext\pytorch_model.bin
2021-08-02 12:18:22 INFO     transformers.modeling_tf_pytorch_utils  - Loading PyTorch weights from E:\Project\xxx\xxx-rasa2.0\hfl\chinese_roberta_wwm_ext\pytorch_model.bin
2021-08-02 12:18:22 INFO     transformers.modeling_tf_pytorch_utils  - PyTorch checkpoint contains 119,108,746 parameters
2021-08-02 12:18:23 INFO     transformers.modeling_tf_pytorch_utils  - Loaded 102,267,648 parameters in the TF 2.0 model.
2021-08-02 12:18:23 INFO     transformers.modeling_tf_pytorch_utils  - Weights or buffers not loaded from PyTorch model: {'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias'}
2021-08-02 12:18:23 INFO     rasa.nlu.components  - Added 'LanguageModelFeaturizer' to component cache. Key 'LanguageModelFeaturizer-bert-0188383f103a0471163ca472b71b7ca9'.
c:\anaconda3\envs\rasa2\lib\site-packages\rasa\utils\train_utils.py:646: UserWarning: constrain_similarities is set to `False`. It is recommended to 
set it to `True` when using cross-entropy loss. It will be set to `True` by default, Rasa Open Source 3.0.0 onwards.
  category=UserWarning,
2021-08-02 12:18:25 INFO     rasa.shared.nlu.training_data.training_data  - Training data stats:
2021-08-02 12:18:25 INFO     rasa.shared.nlu.training_data.training_data  - Number of intent examples: 1853 (24 distinct intents)

2021-08-02 12:18:25 INFO     rasa.shared.nlu.training_data.training_data  -   Found intents: '', '', '', '', '', '', 'exit', '', 'affirm', '', '', '', '', '', 'chitchat_goodbye', 'chitchat_who_are_you', '', 'chitchat_what_can_you_do', '', '', '', 'chitchat_greet', '', 'chitchat_thanks'
2021-08-02 12:18:25 INFO     rasa.shared.nlu.training_data.training_data  - Number of response examples: 0 (0 distinct responses)
2021-08-02 12:18:25 INFO     rasa.shared.nlu.training_data.training_data  - Number of entity examples: 1257 (6 distinct entities)
2021-08-02 12:18:25 INFO     rasa.shared.nlu.training_data.training_data  -   Found entity types: '', '', '', '', 'name', ''
2021-08-02 12:18:25 INFO     rasa.nlu.model  - Starting to train component JiebaTokenizer
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.760 seconds.
Prefix dict has been built successfully.
2021-08-02 12:18:26 INFO     rasa.nlu.model  - Finished training component.
2021-08-02 12:18:26 INFO     rasa.nlu.model  - Starting to train component LanguageModelFeaturizer
2021-08-02 12:18:31 INFO     rasa.nlu.model  - Finished training component.
2021-08-02 12:18:31 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2021-08-02 12:18:31 INFO     rasa.nlu.model  - Finished training component.
2021-08-02 12:18:31 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
c:\anaconda3\envs\rasa2\lib\site-packages\rasa\shared\utils\io.py:97: UserWarning: Misaligned entity annotation in message '跟踪目标流程' with intent 'okr_follow'. Make sure the start and end values of entities ([(2, 4, '目标')]) in the training data match the token boundaries ([(0, 4, '跟踪目标'), (4, 6, '流程')]). Common causes:
  1) entities include trailing whitespaces or punctuation
  2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
  More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
Epochs:   0%|                                                                                                               | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
  File "c:\anaconda3\envs\rasa2\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\anaconda3\envs\rasa2\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\Anaconda3\envs\rasa2\Scripts\rasa.exe\__main__.py", line 7, in <module>
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\__main__.py", line 117, in main
    cmdline_arguments.func(cmdline_arguments)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\cli\train.py", line 59, in <lambda>
    train_parser.set_defaults(func=lambda args: run_training(args, can_exit=True))
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\cli\train.py", line 103, in run_training
    finetuning_epoch_fraction=args.epoch_fraction,
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\api.py", line 124, in train
    loop,
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\utils\common.py", line 296, in run_in_loop
    result = loop.run_until_complete(f)
  File "c:\anaconda3\envs\rasa2\lib\asyncio\base_events.py", line 587, in run_until_complete
    return future.result()
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\model_training.py", line 119, in train_async
    finetuning_epoch_fraction=finetuning_epoch_fraction,
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\model_training.py", line 299, in _train_async_internal
    finetuning_epoch_fraction=finetuning_epoch_fraction,
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\model_training.py", line 342, in _do_training
    finetuning_epoch_fraction=finetuning_epoch_fraction,
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\model_training.py", line 765, in _train_nlu_with_validated_data
    **additional_arguments,
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\nlu\train.py", line 116, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\nlu\model.py", line 221, in train
    component.train(working_data, self.config, **context)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\nlu\classifiers\diet_classifier.py", line 887, in train
    shuffle=False,  # we use custom shuffle inside data generator
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\keras\engine\training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\rasa\utils\tensorflow\temp_keras_modules.py", line 191, in fit
    tmp_logs = train_function(iterator)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\def_function.py", line 807, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\function.py", line 550, in call
    ctx=ctx)
  File "c:\anaconda3\envs\rasa2\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [64,27] vs. [64,26]
         [[{{node cond/else/_13/cond/add_1}}]]
         [[crf/cond/else/_1/crf/cond/Cast/_272]]
  (1) Invalid argument:  Incompatible shapes: [64,27] vs. [64,26]
         [[{{node cond/else/_13/cond/add_1}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_60261]

Function call stack:
train_function -> train_function

Epochs:   0%|                                                                                                               | 0/100 [00:10<?, ?it/s]

Command or request that led to error:

Content of configuration file (config.yml) (if relevant):

Content of domain file (domain.yml) (if relevant):

Definition of Done

[ ] Reproduce problem
[ ] Scope possible solutions
[ ] @koernerfelicia assigned reviewer

sara-tagger commented 3 years ago

Thanks for the issue, @alopez will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

jupyterjazz commented 3 years ago

Hi @jk123vip, can you please share your config.yml file? We need it to reproduce the problem.

kedz commented 2 years ago

@jk123vip we are going to mark this as closed for now but if you are able to share the config we can reopen it.

Mousaic commented 2 years ago

language: zh pipeline:

name: HFTransformersNLP model_name: bert model_weights: bert-base-chinese cache_dir: data/bert-base-chinese
name: rasa_chinese.nlu.tokenizers.lm_tokenizer.LanguageModelTokenizer tokenizer_url: 'http://127.0.0.1:8000/'
name: LanguageModelFeaturizer
name: DIETClassifier epochs: 50
name: EntitySynonymMapper
name: ResponseSelector epochs: 50 policies:
name: TEDPolicy epochs: 200 max_history: 8

this is the config of nlu

RasaHQ / rasa

Rasa2.0 training with GPU failed. Invalid argument #9249

You may find help in the docs and the forum, too 🤗