Tencent / NeuralNLP-NeuralClassifier

An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Other
1.83k stars 402 forks source link

About RCV1-v2 Dataset Train-Dev-Test Split #26

Closed MemoriesJ closed 4 years ago

MemoriesJ commented 4 years ago

Hi,

Excellent work of providing this useful toolkit. I have some questions on the train-dev-split for RCV1-v2 dataset.

I noticed that you follow the standard split provided by Lewis et al. (2004). However, there is no standard dev set split.

Could you give the ID (or date range in the tag ) from raw data of dev set for me? Or you may use random split, could you tell me that where the dev set is drawn from (i.e. training set)?

coderbyr commented 4 years ago

Hi,

Excellent work of providing this useful toolkit. I have some questions on the train-dev-split for RCV1-v2 dataset.

I noticed that you follow the standard split provided by Lewis et al. (2004). However, there is no standard dev set split.

Could you give the ID (or date range in the tag ) from raw data of dev set for me? Or you may use random split, could you tell me that where the dev set is drawn from (i.e. training set)?

Currently, there is not easy to get public processed RCV1 dataset on website, except you follow this (https://trec.nist.gov/data/reuters/reuters.html). We download origin XML dataset of RCV1 and extract labels and texts, and then split it into train and test set according to standard train records.
Dev set is randomly drawn from train set.

MemoriesJ commented 4 years ago

Hi, Excellent work of providing this useful toolkit. I have some questions on the train-dev-split for RCV1-v2 dataset. I noticed that you follow the standard split provided by Lewis et al. (2004). However, there is no standard dev set split. Could you give the ID (or date range in the tag ) from raw data of dev set for me? Or you may use random split, could you tell me that where the dev set is drawn from (i.e. training set)?

Currently, there is not easy to get public processed RCV1 dataset on website, except you follow this (https://trec.nist.gov/data/reuters/reuters.html). We download origin XML dataset of RCV1 and extract labels and texts, and then split it into train and test set according to standard train records. Dev set is randomly drawn from train set.

Hi. I can access the original data. Thanks for notifying me that. But it seems the given dev samples in your "data/rcv1_dev.json" is drawn from the testing set... Could provide the split ratio you used for drawing dev set from training set?

coderbyr commented 4 years ago

Hi, Excellent work of providing this useful toolkit. I have some questions on the train-dev-split for RCV1-v2 dataset. I noticed that you follow the standard split provided by Lewis et al. (2004). However, there is no standard dev set split. Could you give the ID (or date range in the tag ) from raw data of dev set for me? Or you may use random split, could you tell me that where the dev set is drawn from (i.e. training set)?

Currently, there is not easy to get public processed RCV1 dataset on website, except you follow this (https://trec.nist.gov/data/reuters/reuters.html). We download origin XML dataset of RCV1 and extract labels and texts, and then split it into train and test set according to standard train records. Dev set is randomly drawn from train set.

Hi. I can access the original data. Thanks for notifying me that. But it seems the given dev samples in your "data/rcv1_dev.json" is drawn from the testing set... Could provide the split ratio you used for drawing dev set from training set?

The datasets under "data/" are toy data. we randomly drawn 10% percents from origin train set as dev and the rest as train.

MemoriesJ commented 4 years ago

Hi,

Thanks for your patience again. As original train-test split is 23149 vs. 781264 So can I understand that the train-dev-test split you used is similar as (23149-2314) vs. 2314 vs. 781264

coderbyr commented 4 years ago

Yes, you're right.