Rasa train nlu, bert, finetuned weight, Error: from_pt set to False

tinlokkoo commented 4 years ago

Description of Problem:

The rasa cli does not support loading a bert model that is trained from pytorch. It gives the following error OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5'] found in directory ..\model\deploy_transformer or from_pt set to False

My config yml: - name: HFTransformersNLP model_weights: "C:\\work_directory\\nlp\\model\\transformers\bert" model_name: "bert"

My directory in the model folder:

config.json
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
vocab.txt
pytorch_model.bin

Overview of the Solution:

I now need to add "from_pt=True" manually in the rasa/nlu/utils/hugging_face/hf_transformers.py : 84 in order to use pytorch model. A better solution is to be able to do so in the config.yml file

Examples (if relevant):

Blockers (if relevant):

Definition of Done:

sara-tagger commented 4 years ago

Thanks for submitting this feature request 🚀 @akelad will get back to you about it soon! ✨

dakshvar22 commented 4 years ago

@tinlokkoo That parameter is deliberately set to False because we don't support pytorch as a backend currently. Is this a custom trained bert model that you are trying to load?

tinlokkoo commented 4 years ago

Yes, I pre-trained a Bert Model in PyTorch and I want to plug that into rasa

dakshvar22 commented 4 years ago

@tinlokkoo You can use this script to convert the pytorch trained model to tensorflow and then use that model in your rasa pipeline.

tinlokkoo commented 4 years ago

Thx @dakshvar22 That was the wrong script which use .ckpt This is the right script

Btw, for those may have the same problem. The transformer scripts have bugs. Please refer to this and this to fix.

raff-run commented 4 years ago

Excuse me @tinlokko or @dakshvar22 but could you verify if I made the conversion correctly, or share the steps you used to convert the bert model to tf2? I used the right script and applied the first fix(the second is already applied to master, so the script link already contains it) and was able to convert a BERT model to .h5, however when referenced on the pipeline on model_weights, the error "'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte" happens:

2020-07-28 17:43:07 WARNING  transformers.tokenization_utils_base  - Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated
2020-07-28 17:43:07 INFO     transformers.tokenization_utils_base  - loading file ./convert/tf/converted_model-tf_model.h5
Traceback (most recent call last):
  File "c:\users\user\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\user\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\user\.virtualenvs\Rasa_poc-5z5_TGNA\Scripts\rasa.exe\__main__.py", line 7, in <module>
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\__main__.py", line 92, in main
    cmdline_arguments.func(cmdline_arguments)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\cli\train.py", line 76, in train
    additional_arguments=extract_additional_arguments(args),
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\train.py", line 50, in train
    additional_arguments=additional_arguments,
  File "c:\users\user\appdata\local\programs\python\python36\lib\asyncio\base_events.py", line 484, in run_until_complete
    return future.result()
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\train.py", line 101, in train_async
    additional_arguments,
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\train.py", line 188, in _train_async_internal
    additional_arguments=additional_arguments,
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\train.py", line 245, in _do_training
    persist_nlu_training_data=persist_nlu_training_data,
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\train.py", line 482, in _train_nlu_with_validated_data
    persist_nlu_training_data=persist_nlu_training_data,
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\train.py", line 75, in train
    trainer = Trainer(nlu_config, component_builder)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\model.py", line 145, in __init__
    self.pipeline = self._build_pipeline(cfg, component_builder)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\model.py", line 157, in _build_pipeline
    component = component_builder.create_component(component_cfg, cfg)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\components.py", line 781, in create_component
    component = registry.create_component_by_config(component_config, cfg)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\registry.py", line 246, in create_component_by_config
    return component_class.create(component_config, config)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\components.py", line 489, in create
    return cls(component_config)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\utils\hugging_face\hf_transformers.py", line 47, in __init__
    self._load_model()
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\rasa\nlu\utils\hugging_face\hf_transformers.py", line 81, in _load_model
    self.model_weights, cache_dir=self.cache_dir
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\transformers\tokenization_utils_base.py", line 1140, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\transformers\tokenization_utils_base.py", line 1287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\transformers\tokenization_bert.py", line 192, in __init__
    self.vocab = load_vocab(vocab_file)
  File "c:\users\user\.virtualenvs\rasa_poc-5z5_tgna\lib\site-packages\transformers\tokenization_bert.py", line 104, in load_vocab
    tokens = reader.readlines()
  File "c:\users\user\appdata\local\programs\python\python36\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

My steps were:

Downloaded the right script, named it convert.py
Applied the first fix to the script
Downloaded config.json and pytorch_model.bin from the model page
Used python convert.py --tf_dump_path ./tf/ --model_type bert-large-cased-whole-word-masking-finetuned-squad --pytorch_checkpoint_path ./bert/pytorch_model.bin --config_file ./bert/config.json

Changed my config.yml to point to the local model

- name: HFTransformersNLP
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "./tf/converted_model-tf_model.h5"

I'm a complete beginner at using Rasa (and NLP in general) so maybe I missed something? Thank you in advance!

tinlokkoo commented 4 years ago

@raff-run May I know which bert model did you downloaded?

raff-run commented 4 years ago

I tried this one (BertForMaskedLM) https://huggingface.co/neuralmind/bert-base-portuguese-cased, which apparently has a tensorflow version but I couldn't get it to work, and this other one (BertForQuestionAnswering) https://huggingface.co/mrm8488/bert-base-portuguese-cased-finetuned-squad-v1-pt

tinlokkoo commented 4 years ago

@raff-run I think I know the problem. when u see the traceback, it was the problem of transformer to read the vocab file. please check 1) do u have vocab.txt in ./convert/tf 2) can u open vocab.txt with 'utf-8' encoding try to do the following to see if there is problem of the encoding with open('<your_path>/convert/tf/vocab.txt', encoing='utf-8') as f: f.read() if 2 does not work, then u have to fix the vocab file or change your encoding

raff-run commented 4 years ago

Sorry for the late reply! Just checked and it seems fine, with 29k lines:

The file is indeed encoded in utf-8 and the script ran without issues, too. After running it as is, I changed f.read() to print(f.readlines()) and it printed everything okay.

raff-run commented 4 years ago

However, I think we're on the right track now! I added a print command into the code that tries to read the vocab file: And it printed this:

For some reason it gets the h5 file as the vocab file, and since the following functions don't treat this string:

It tries to read a h5 file as the vocabulary file and fails:

Edit: Found the problem. If you put model_weights: "convert/tf/tf_model.h5", the script will try to load the model as the vocab file, but will load the files correctly if you put the directory instead, like model_weights: "convert/tf". Thank you for the help, @tinlokkoo! I hope this saves someone else from a similar problem.

For whoever has the same problem: Also be aware that to convert you need config.json and the .bin, but to use it on Rasa, you need basically everything from the model page on huggingface!

ctlgcustodio commented 2 years ago

Can you share your tree directory of project rasa? I'm trying to load model from cache_dir and model_weights, but I think that not define my path correctly

ctlgcustodio commented 2 years ago

@raff-run

raff-run commented 2 years ago

Sure! Here:

That said this was before I migrated the project to rasa 2.0. After that I'd only use spacy, so I don't know if it still works.

ctlgcustodio commented 2 years ago

Certo! Aqui:

Dito isso, isso foi antes de eu migrar o projeto para o rasa 2.0. Depois disso eu só usaria o espaço, então não sei se ainda funciona. Apparently, HFTransformersNLP is deprecated in rasa version 3.0, but I adjusted to use rasa 2.8 and transformers 2.11.0, and BERT model without HFTransformers component. In this case, I had some difficulties loading weights and model, as we trained using config.yml which specifies cache_dir (cache_path) and model_weights (bert-base-cased-multilingual, for example), but when I try to load this model, it doesn't finds the files and is all lost. To solve this I needed to rename the key "cache_dir" and "model_weights" in the metadata.json of model.tar.gz (generated from training) to configure the absolute path of cache_dir and model... Because if I take the model.tar.gz and the weights, put them in another folder other than the training one, the model is already lost to load. I didn't find anything documented. So if I use a container docker that changes a workdir, the transformer library doesn't find the model with cache_path and model_name used in training. But renaming it worked. Thanks for sharing, I appreciate it.

RasaHQ / rasa

Rasa train nlu, bert, finetuned weight, Error: from_pt set to False #6071