请问有预训练用的数据吗？

lemuria-wchen commented 3 years ago

就是原文中提到的 Large-scale conversation datasets – Twitter (Cho et al., 2014) and Reddit (Zhou et al., 2018; Galley et al., 2019) are employed for pretraining, which results in 8.3 million training samples in total.

sserdoubleh commented 3 years ago

可以参考paper里的 Twitter: https://github.com/marsan-ma/chat_corpus Reddit: http://coai.cs.tsinghua.edu.cn/hml/dataset/#commonsense https://github.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction

lemuria-wchen commented 3 years ago

http://coai.cs.tsinghua.edu.cn/hml/dataset/#commonsense

这个链接失效了

-----原始邮件----- 发件人:sserdoubleh notifications@github.com 发送时间:2020-11-25 14:20:07 (星期三) 收件人: PaddlePaddle/Research Research@noreply.github.com 抄送: lemuria-wchen 18110980003@fudan.edu.cn, Author author@noreply.github.com 主题: Re: [PaddlePaddle/Research] 请问有预训练用的数据吗？ (#109)

可以参考paper里的 Twitter: https://github.com/marsan-ma/chat_corpus Reddit: http://coai.cs.tsinghua.edu.cn/hml/dataset/#commonsense https://github.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

lemuria-wchen commented 3 years ago

Reddit 数据似乎要自己爬呀？

PaddlePaddle / Research

请问有预训练用的数据吗？ #109