Tencent / NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Other
1.85k stars 406 forks source link

Thanks for sharing this! Just a question about the sample dataset provided. #42

Closed wangcongcong123 closed 4 years ago

wangcongcong123 commented 4 years ago

I went to the website (http://manikvarma.org/downloads/XC/XMLRepository.html) for downloading rcv1-2 dataset where I only find numeric form of the dataset, i.e. samples exist by the form of feature representations instead of raw texts. Just curious about how you convert it to the raw tokens as in your repository: data/rcv1_*.json -> "doc_token"

Thanks.

coderbyr commented 4 years ago

I went to the website (http://manikvarma.org/downloads/XC/XMLRepository.html) for downloading rcv1-2 dataset where I only find numeric form of the dataset, i.e. samples exist by the form of feature representations instead of raw texts. Just curious about how you convert it to the raw tokens as in your repository: data/rcv1_*.json -> "doc_token"

Thanks. It's not the right version of RCV1 dataset (Reuters Corpus, Volume 1) here we used, you may find a raw token version.