examples/text_matching/simcse 中结果无法复现

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

11.98k stars 2.92k forks source link

examples/text_matching/simcse 中结果无法复现 #3245

Closed jeffzhengye closed 1 year ago

jeffzhengye commented 2 years ago

examples/text_matching/simcse 中结果无法复现。readme 中数据集，参数都是已经设定好的，按理说直接运行就可以复现，但是无法复现readme中结果。差不多每个数据都差10个百分点，可能是什么原因啊？以下是我的训练命令： ··· python -u -m paddle.distributed.launch --gpus '0' \ train.py \ --device gpu \ --save_dir ./checkpoints/ \ --batch_size 64 \ --learning_rate 5E-5 \ --epochs 1 \ --save_steps 100 \ --eval_steps 100 \ --max_seq_length 64 \ --dropout 0.3 \ --dup_rate 0.32 \ --warmup_proportion 0.1 \ --train_set_file "./senteval_cn/LCQMC/train.txt" \ --test_set_file "./senteval_cn/LCQMC/dev.tsv" ···

JunnYu commented 2 years ago

你好，我发现了 https://github.com/PaddlePaddle/PaddleNLP/pull/2728 这个PR将预训模型从ernie-1.0转化为ernie-3.0-medium-zh了。因此如果你想要复现结果的话，可以修改一下代码，train.py里面的预训练模型改成ernie-1.0

下面的这个我没跑完，改完模型后，效果不会差10个点了。

jeffzhengye commented 2 years ago

我试试，不过按理，不会差这么多。而且3.0应该更好才对啊

JunnYu commented 2 years ago

ernie3.0 meduim是 "num_hidden_layers": 6 层的模型。 ernie1.0 base是 "num_hidden_layers": 12 层的模型。层数不同效果有所差距。

jeffzhengye commented 2 years ago

ernie3.0 meduim是 "num_hidden_layers": 6 层的模型。 ernie1.0 base是 "num_hidden_layers": 12 层的模型。层数不同效果有所差距。

层数好像不是关键因素，换"ernie-3.0-base-zh"，也是12层，结果还是差10点左右。可能还有什么其它原因啊？

下面是ernie-1.0 和 ernie-3.0-base-zh 的参数对比。

"ernie-1.0": { "attention_probs_dropout_prob": 0.1, "hidden_act": "relu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "max_position_embeddings": 513, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 18000, "pad_token_id": 0, }, "ernie-3.0-base-zh": { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "max_position_embeddings": 2048, "num_attention_heads": 12, "num_hidden_layers": 12, "task_type_vocab_size": 3, "type_vocab_size": 4, "use_task_id": True, "vocab_size": 40000 },

JunnYu commented 2 years ago

这可能有多种原因造成：

这两个模型训练的方法（是否融入了知识，ernie1.0在训练的时候融入了知识信息，是否使用NSP任务之类的）。
这两个模型训练使用的数据集和训练时间。
还有那个下游微调的参数对于ernie1.0可能非常好，但是对于3.0可能不太友好。(最可能是这个原因！)

JunnYu commented 2 years ago

当我将dropout=0.3替换成dropout=0.1的时候，使用enrie3.0 base结果不会差10个点了，我这里测试达到了0.5895。

python -u -m paddle.distributed.launch --gpus '5' \
    train.py \
    --device gpu \
    --save_dir ./checkpoints/ \
    --batch_size 64 \
    --learning_rate 5E-5 \
    --epochs 1 \
    --save_steps 100 \
    --eval_steps 100 \
    --max_seq_length 64 \
    --dropout 0.1 \
    --train_set_file "./senteval_cn/LCQMC/train.txt" \
    --test_set_file "./senteval_cn/LCQMC/dev.tsv"

jeffzhengye commented 2 years ago

@JunnYu 我也发现调节参数对实验结果影响非常大。而且有意思的是，正如你实验，global step 700基本就快达到最好结果了，但一个epoch 大概有10000多个step （batch_size=32 LCQMC）,也就是说只用sample非常小的比例的数据模型就调好了，这个有点不好解释，而且随机性也很大。通常深度学习模型训练更多数据效果更好些

PS: 如果1个epoch训练完，最终结果spearman_corr我这里为负数。

JunnYu commented 2 years ago

我感觉可能由于这个任务较为简单，只需要采样少量数据就可以达到不错的效果。当训练的轮数过多后，模型出现了过拟合的现象，导致模型的效果急剧下降（我也发现一个epoch训练完，效果非常差了）。

jeffzhengye commented 2 years ago

我感觉可能由于这个任务较为简单，只需要采样少量数据就可以达到不错的效果。当训练的轮数过多后，模型出现了过拟合的现象，导致模型的效果急剧下降（我也发现一个epoch训练完，效果非常差了）。

感觉这个实现哪里还是有问题的，换了其它的ernie="rocketqa-zh-dureader-query-encoder"模型，不用训练corr=0.6774420635192209，训练1000个step： best checkpoint has been updated: from last best_score 0.6571129036944603 --> new score 0.6774420635192209. 比现在的最好结果好很多，但试了两个下游任务都会降低。

所以 1. 实现有问题？ 2. spearson 这个指标对下游任务并没有指导意义。

w5688414 commented 2 years ago

我感觉可能由于这个任务较为简单，只需要采样少量数据就可以达到不错的效果。当训练的轮数过多后，模型出现了过拟合的现象，导致模型的效果急剧下降（我也发现一个epoch训练完，效果非常差了）。

感觉这个实现哪里还是有问题的，换了其它的ernie="rocketqa-zh-dureader-query-encoder"模型，不用训练corr=0.6774420635192209，训练1000个step： best checkpoint has been updated: from last best_score 0.6571129036944603 --> new score 0.6774420635192209. 比现在的最好结果好很多，但试了两个下游任务都会降低。

所以 1. 实现有问题？ 2. spearson 这个指标对下游任务并没有指导意义。

rocketqa是一个开放问答领域的检索模型，本身经过有监督数据训练过，请问您是测试的啥任务呢？下游任务与现在的任务相关吗？我们采用的spearson系数是论文使用的指标。

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。