PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.13k stars 2.94k forks source link

paddleNLP在simcse的模型基础上训练in_batch_negative模型 #2575

Closed yrg5101 closed 1 year ago

yrg5101 commented 2 years ago

欢迎您反馈PaddleNLP使用问题,非常感谢您对PaddleNLP的贡献! 在留下您的问题时,辛苦您同步提供如下信息: 1)PaddleNLP 2.3.0.dev,PaddlePaddle2.3.0 2)系统环境:Windows,python:3.7

现在想法是

1.simcse的训练基于ernie3.0来训练, 下面的from_pretrained都改为ernie-3.0, 然后生成一个simcse的模型

pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( args.model_name_or_path, hidden_dropout_prob=args.dropout, attention_probs_dropout_prob=args.dropout) print("loading model from {}".format(args.model_name_or_path)) tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')

2.训练in_batch_negative模型, 基于第一步生成的simcse模型 pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( 'simcse模型路径')

tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('simcse模型路径')

3.做排序模型, 在上面模型得到embedding之后,还是基于ernie-gram-zh做排序

pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained( 'ernie-gram-zh') tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained( 'ernie-gram-zh')

请教,基于上面的思路,是否有问题,还是能否有什么更好的优化点? 谢谢

tianxin1860 commented 2 years ago

当前的思路就是我们推荐的最优思路了,很赞。期待反馈效果,有任何问题也欢迎多反馈或者贡献 PR。

yrg5101 commented 2 years ago

当前的思路就是我们推荐的最优思路了,很赞。期待反馈效果,有任何问题也欢迎多反馈或者贡献 PR。

我们用ernie-gram-zh做了排序, 但是发现速度比较慢,我们是32G+4core CPU, 300条样本要用1-2分钟,感觉有点慢,所有有了如下问题

1.对于多核cpu是不是在ernie-gram-zh模型预测层面做提速?怎么做?

2.ernie-gram-zh模型是否可以做剪裁和量化? 怎么做?安装ernie3.0模型的裁剪和量化来做? https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0

3.对于ernie-gram-zh做排序是否可以换成ernie-3.0-medium-zh来做排序

我们也进行了替换测试,但是报了如下错误:

PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/ranking/ernie_matching/ 使用这个进行排序, 现在想用ernie-3.0-medium-zh来代替原来的ernie-gram-zh模型, 在export_model.py中进行动态模型导出静态模型,做的修改如下:

原来: if name == "main":

If you want to use ernie1.0 model, plesace uncomment the following code

# tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')
# pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0")

pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained(
    'ernie-gram-zh')
tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained(
    'ernie-gram-zh')
model = PairwiseMatching(pretrained_model)

修改之后: if name == "main":

If you want to use ernie1.0 model, plesace uncomment the following code

# tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')
# pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0")

pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained(
    'ernie-gram-zh')
tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained(
    'ernie-3.0-medium-zh')
model = PairwiseMatching(pretrained_model)

报错: [2022-06-20 22:57:05,782] [ INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/community/ernie-3.0-medium-zh\model_state.pdparams and saved to C:\Users\Administrator.paddlenlp\models\ernie-3.0-medium-zh [2022-06-20 22:57:05,783] [ INFO] - Downloading model_state.pdparams from https://bj.bcebos.com/paddlenlp/models/community/ernie-3.0-medium-zh\model_state.pdparams [2022-06-20 22:57:05,998] [ ERROR] - Downloading from https://bj.bcebos.com/paddlenlp/models/community/ernie-3.0-medium-zh\model_state.pdparams failed with code 404! Traceback (most recent call last): File "C:\Users\Administrator\Desktop\tx\PaddleNLP\paddlenlp\transformers\model_utils.py", line 253, in from_pretrained file_path, default_root) File "C:\Users\Administrator\Desktop\tx\PaddleNLP\paddlenlp\utils\downloader.py", line 164, in get_path_from_url fullpath = _download(url, root_dir, md5sum) File "C:\Users\Administrator\Desktop\tx\PaddleNLP\paddlenlp\utils\downloader.py", line 201, in _download "{}!".format(url, req.status_code)) RuntimeError: Downloading from https://bj.bcebos.com/paddlenlp/models/community/ernie-3.0-medium-zh\model_state.pdparams failed with code 404!

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:/Users/Administrator/Desktop/tx/PaddleNLP/applications/neural_search/ranking/ernie_matching/export_model.py", line 40, in 'ernie-3.0-medium-zh') File "C:\Users\Administrator\Desktop\tx\PaddleNLP\paddlenlp\transformers\model_utils.py", line 257, in from_pretrained f"Can't load weights for '{pretrained_model_name_or_path}'.\n" RuntimeError: Can't load weights for 'ernie-3.0-medium-zh'. Please make sure that 'ernie-3.0-medium-zh' is:

tianxin1860 commented 2 years ago
  1. 多核 CPU 提速可以参考 ERNIE3.0 示例中的 CPU 加速预测方式: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0/deploy/python#11-CPU%E7%AB%AF
  2. ERNIE-Gram-zh 可以参考 https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-3.0 示例进行裁剪和量化,不过建议先进行步骤 1 的加速操作,如果性能还不满足预期可以进一步考虑裁剪和量化。
  3. ernie-gram-zh 可以换成 ernie-3.0-medium-zh 来做排序,修改代码如下即可:
pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained(
    ''ernie-3.0-medium-zh')
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(
    'ernie-3.0-medium-zh')
model = PairwiseMatching(pretrained_model)
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。