PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.1k stars 2.94k forks source link

model和tokenizer怎么指定文件夹 #763

Closed littletomatodonkey closed 3 years ago

littletomatodonkey commented 3 years ago

把模型和字典拷贝到当前目录,运行下面的命令,无法成功构建tokenizer和model tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") model = BertModel.from_pretrained("bert-base-uncased")

提示:

ValueError: Calling BertTokenizer.from_pretrained() with a model identifier or the path to a directory instead. The supported model identifiers are as follows: dict_keys(['bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased', 'bert-base-multilingual-uncased', 'bert-base-multilingual-cased', 'bert-base-chinese', 'bert-wwm-chinese', 'bert-wwm-ext-chinese', 'macbert-large-chinese', 'macbert-base-chinese', 'simbert-base-chinese'])

文件夹中包含bert-base-uncased-vocab.txtbert-base-uncased.pdparams

linjieccc commented 3 years ago

您好,无需把模型和字典拷贝到当前目录 执行

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

会自动下载bert-base-uncased-vocab.txtbert-base-uncased.pdparams文件到$HOME/.paddlenlp/models/bert-base-uncased路径下并完成tokenizer和model的加载

littletomatodonkey commented 3 years ago

您好,无需把模型和字典拷贝到当前目录 执行

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

会自动下载bert-base-uncased-vocab.txtbert-base-uncased.pdparams文件到$HOME/.paddlenlp/models/bert-base-uncased路径下并完成tokenizer和model的加载

目前我是希望可以加载自己训练的bert模型,然后发现指定本地文件夹会报错来着,目前是不支持指定本地文件夹路径的加载嘛还是?

linjieccc commented 3 years ago

目前加载自己的tokenizer和model需要配置config file (tokenizer_config.jsonmodel_config.json) 如果自己训练的模型是通过save_pretrained()接口保存的,便可以通过

tokenizer = BertTokenizer.from_pretrained('/path/to/your/save_pretrained/directory')
model = BertModel.from_pretrained('/path/to/your/save_pretrained/directory')

来加载tokenizer和model

littletomatodonkey commented 3 years ago

不是通过这个接口来保存的,只是将torch的模型转化过来了,bert的config file在哪里呢?我在repo和~/.paddlenlp目录里没找到

linjieccc commented 3 years ago

加载bert-base-uncased的tokenizer和model

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

通过save_pretrained()保存到./trained_model./trained_model包含model_config.jsonmodel_state.pdparamstokenizer_config.jsonvocab.txt

tokenizer.save_pretrained('trained_model')
model.save_pretrained('trained_model')

从包含config file,模型和vocab的路径加载tokenizer和model

tokenizer = BertTokenizer.from_pretrained('trained_model')
model = BertModel.from_pretrained('trained_model')
littletomatodonkey commented 3 years ago

好的,感谢回复,可以work了~