在复现Ner的Conll2003时一些关于metric的疑问

THUDM / P-tuning-v2

An optimized deep prompt tuning strategy comparable to fine-tuning across scales and tasks

Apache License 2.0

1.96k stars 198 forks source link

在复现Ner的Conll2003时一些关于metric的疑问 #43

Closed Luohuarucanxue closed 1 year ago

Luohuarucanxue commented 1 year ago

我在复现PT2在Ner的Conll2003时数据来源于hugging face根据源代码提供的metric进行计算 roberta-large模型在验证集上返回了95+的f1_score 随后在roberta-large 上进行了全微调仅3个epoch就超过了本文提供的fine-tuning f1 score baseline 1% . 有些疑问文章中报告的结果是metric 直接返回的seqeval.metric 的overall_f1吗还是经过额外的计算。能否提供PT2在训练conll2004的数据集原始文件吗谢谢!

Xiao9905 commented 1 year ago

@Luohuarucanxue 你好，

NER和SRL任务我们并非使用的Huggingface Dataset自动下载的数据集。请按照我们的README中的说明，可以下载获取CoNLL03和CoNLL04训练用的文件。

参考PaperWithCodes中的结果，似乎目前CoNLL03最高的F1结果也只有94.6。我猜测是Huggingface dataset提供的数据或者脚本有问题。

THUDM / P-tuning-v2

在复现Ner的Conll2003时 一些关于metric的疑问 #43

在复现Ner的Conll2003时一些关于metric的疑问 #43