请问Summarization中的Newsroom数据集是如何预处理的呢？还是它下载下来就是预处理好的？

fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

https://gitee.com/fastnlp/fastNLP

Apache License 2.0

3.06k stars 450 forks source link

Closed fseasy closed 4 years ago

fseasy commented 4 years ago

HI，非常感谢，从这里的链接里下载到了非常不好下载的 newsroom 数据集。我看下载的数据以及README里的描述

其中公开数据集(CNN/DailyMail, Newsroom, arXiv, PubMed)预处理之后的下载地址：

请问这个预处理具体是怎么做的呢？

我看结果，里面应该是做了 tokenization, sentence-split 操作，想问一下这两个操作是具体拿什么工具做的呢？万分感谢！

yhcc commented 4 years ago

这个是StanfordCoreNLP做的tokenize，nltk做的sentence split。