Open lemuria-wchen opened 3 years ago
http://coai.cs.tsinghua.edu.cn/hml/dataset/#commonsense
这个链接失效了
-----原始邮件----- 发件人:sserdoubleh notifications@github.com 发送时间:2020-11-25 14:20:07 (星期三) 收件人: PaddlePaddle/Research Research@noreply.github.com 抄送: lemuria-wchen 18110980003@fudan.edu.cn, Author author@noreply.github.com 主题: Re: [PaddlePaddle/Research] 请问有预训练用的数据吗? (#109)
可以参考paper里的 Twitter: https://github.com/marsan-ma/chat_corpus Reddit: http://coai.cs.tsinghua.edu.cn/hml/dataset/#commonsense https://github.com/mgalley/DSTC7-End-to-End-Conversation-Modeling/tree/master/data_extraction
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Reddit 数据似乎要自己爬呀?
就是原文中提到的 Large-scale conversation datasets – Twitter (Cho et al., 2014) and Reddit (Zhou et al., 2018; Galley et al., 2019) are employed for pretraining, which results in 8.3 million training samples in total.