model和tokenizer怎么指定文件夹

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

12.1k stars 2.94k forks source link

model和tokenizer怎么指定文件夹 #763

Closed littletomatodonkey closed 3 years ago

littletomatodonkey commented 3 years ago

把模型和字典拷贝到当前目录，运行下面的命令，无法成功构建tokenizer和model tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") model = BertModel.from_pretrained("bert-base-uncased")

提示：

ValueError: Calling BertTokenizer.from_pretrained() with a model identifier or the path to a directory instead. The supported model identifiers are as follows: dict_keys(['bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased', 'bert-base-multilingual-uncased', 'bert-base-multilingual-cased', 'bert-base-chinese', 'bert-wwm-chinese', 'bert-wwm-ext-chinese', 'macbert-large-chinese', 'macbert-base-chinese', 'simbert-base-chinese'])

文件夹中包含bert-base-uncased-vocab.txt和bert-base-uncased.pdparams

linjieccc commented 3 years ago

您好，无需把模型和字典拷贝到当前目录执行

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

会自动下载bert-base-uncased-vocab.txt和bert-base-uncased.pdparams文件到$HOME/.paddlenlp/models/bert-base-uncased路径下并完成tokenizer和model的加载

littletomatodonkey commented 3 years ago

您好，无需把模型和字典拷贝到当前目录执行
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
会自动下载bert-base-uncased-vocab.txt和bert-base-uncased.pdparams文件到$HOME/.paddlenlp/models/bert-base-uncased路径下并完成tokenizer和model的加载

目前我是希望可以加载自己训练的bert模型，然后发现指定本地文件夹会报错来着，目前是不支持指定本地文件夹路径的加载嘛还是？

linjieccc commented 3 years ago

目前加载自己的tokenizer和model需要配置config file (tokenizer_config.json和model_config.json) 如果自己训练的模型是通过save_pretrained()接口保存的，便可以通过

tokenizer = BertTokenizer.from_pretrained('/path/to/your/save_pretrained/directory')
model = BertModel.from_pretrained('/path/to/your/save_pretrained/directory')

来加载tokenizer和model

littletomatodonkey commented 3 years ago

不是通过这个接口来保存的，只是将torch的模型转化过来了，bert的config file在哪里呢？我在repo和~/.paddlenlp目录里没找到

linjieccc commented 3 years ago

加载bert-base-uncased的tokenizer和model

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

通过save_pretrained()保存到./trained_model，./trained_model包含model_config.json，model_state.pdparams，tokenizer_config.json，vocab.txt

tokenizer.save_pretrained('trained_model')
model.save_pretrained('trained_model')

从包含config file，模型和vocab的路径加载tokenizer和model

tokenizer = BertTokenizer.from_pretrained('trained_model')
model = BertModel.from_pretrained('trained_model')

littletomatodonkey commented 3 years ago

好的，感谢回复，可以work了~