使用ERNIE\applications\tasks\text_matching python示例代码测试文本相似度得分问题

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

12.13k stars 2.94k forks source link

使用ERNIE\applications\tasks\text_matching python示例代码测试文本相似度得分问题 #2680

Closed lbz0920 closed 1 year ago

lbz0920 commented 2 years ago

请提出你的问题 Please ask your question

两段短文本比较，完全不同的两个文本： : run_infer.py:50 * 9640 ('在家电脑做什么兼职好呢\t海尔全自动洗衣机', '[0.22760319709777832, 0.7723968029022217]')，这两个结果字段代表什么意思？有文档说明吗？怎么才能求出文本的相似度得分？

paddle-bot-old[bot] commented 2 years ago

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

chenxiaozeng commented 2 years ago

你好，可以在ERNIE repo下提个issue，会有相关同学解答。另外也可试试 PaddleNLP的文本相似度功能：https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md#%E6%96%87%E6%9C%AC%E7%9B%B8%E4%BC%BC%E5%BA%A6

tianxin1860 commented 2 years ago

请提出你的问题 Please ask your question

两段短文本比较，完全不同的两个文本： : run_infer.py:50 * 9640 ('在家电脑做什么兼职好呢\t海尔全自动洗衣机', '[0.22760319709777832, 0.7723968029022217]')，这两个结果字段代表什么意思？有文档说明吗？怎么才能求出文本的相似度得分？

输出的 2 个概率值: 第 1 个概率表示这两个文本语义相似的概率，第 2 个概率值表示这两个文本语义不相似的概率， 2 个概率值求和为 1.0。

lbz0920 commented 2 years ago

INFO: 06-30 09:51:47: run_infer.py:50 1904 ('海尔全自动洗衣机用着不错\t海尔全自动洗衣机好用', '[0.033826589584350586, 0.9661734104156494]') INFO: 06-30 09:51:47: run_infer.py:50 1904 ('分数混合运算三\t什么方法解酒最快', '[0.596703052520752, 0.40329691767692566]') 第1个测试句子：语义一样吧，怎么0.033,第2个句对完全不一样，相似概率0.596,是训练的模型有问题吗？用的ernie_3.0_base_ch

tianxin1860 commented 2 years ago

没有问题，最开始的回复口误说反了，正确的含义是: 第 1 个概率表示这两个文本语义不相似的概率，第 2 个概率值表示这两个文本语义相似的概率， 2 个概率值求和为 1.0。

lbz0920 commented 2 years ago

('在家电脑做什么兼职好呢\t海尔全自动洗衣机', '[0.22760319709777832, 0.7723968029022217]')，这两个短文本完全不同，怎么语义相似概率0.77这么高？怎么才能求出文本的相似度得分？
c++中对文本相似度比较怎么实现，使用ernie项目(ernie_3.0_base_ch预训练模型训练后的模型)+paddle inference推理库,还是paddlenp项目(ernie_gram模型)+paddle inference库？这样翻案对吗？到底选ernie3.0模型训练还是paddlenlp的ernie_gram模型训练，一直疑惑? 望指教用ERNIE\applications\tasks\text_matching示例代码：+ErnieMatchingSiamesePairwise模型训练？还是用PaddleNLP\applications\neural_search\ranking\ernie_matching（PaddleNLP\examples\text_matching\ernie_matching）训练后的模型，

w5688414 commented 2 years ago

('在家电脑做什么兼职好呢\t海尔全自动洗衣机', '[0.22760319709777832, 0.7723968029022217]')，这两个短文本完全不同，怎么语义相似概率0.77这么高？怎么才能求出文本的相似度得分？

c++中对文本相似度比较怎么实现，使用ernie项目(ernie_3.0_base_ch预训练模型训练后的模型)+paddle inference推理库,还是paddlenp项目(ernie_gram模型)+paddle inference库？这样翻案对吗？到底选ernie3.0模型训练还是paddlenlp的ernie_gram模型训练，一直疑惑? 望指教用ERNIE\applications\tasks\text_matching示例代码：+ErnieMatchingSiamesePairwise模型训练？还是用PaddleNLP\applications\neural_search\ranking\ernie_matching（PaddleNLP\examples\text_matching\ernie_matching）训练后的模型，

1.请问您用的是什么模型？有经过训练什么吗？ 2.ernie-3.0和ernie_gram只是预训练过程的差别，使用上没什么差别，可以根据情况，哪个效果好就使用哪个 ERNIE\applications\tasks\text_matching PaddleNLP\applications\neural_search\ranking\ernie_matching PaddleNLP\examples\text_matching\ernie_matching 这3个都可以使用，优先推荐使用前2个，可以对比一下效果，选择最佳的模型

lbz0920 commented 2 years ago

1。ERNIE\applications\tasks\text_matching：使用推荐的ernie_3.0_base_ch预训练模型:run_train.py+lcqmc数据集训练，lcqmc数据集的train.tsv是lable 0,1，代码要求有neg_title列，这样训练是不是有问题？，是不是修改修改ERNIE\applications\tasks\text_matching代码，把lcqmc数据集test.tsv第3列的lable0,1转化为随机的neg_title，在哪个文件中修改？还是有其他更适合的文本相似度数据集？ 2。PaddleNLP\applications\neural_search\ranking\ernie_matching：使用的程序自动下载的预训练模型.paddlenlp\models\ernie-gram-zh(tran_pairwise.py+lcqmc数据集训练，这样使用ernie-gram-zh+lcqmc数据集能训练出pairwise的模型吗？方法对吗？ 3。想输出文本相似度得分，是不是不能用ernie_gram_zh_pointwise_matching_model.tar模型？这个模型输出0,1不是0.9,0.8这样的评分 4。ERNIE\applications\tasks\text_matching可以使用PaddleNLP\applications\neural_search\ranking\ernie_matching训练出的模型吗？它们可以交叉使用对方训练的模型吗？ 5。长文本相似度比较，是不是把长文本拆分<=512字再比较，之后再算法合并各段落相似度得分，得到总文本总相似度得分，有这样示例吗？望给个示例？ 6。文本相似度得分比较文本，求推荐使用哪个模型+数据集合适，有下载地址吗？

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。