Dataset split and performance results

zqudm commented 1 year ago

Hi, Thank you for your excellent research.

The first question is about the way in which the dataset is splitted. Taking Reuters-21578 as an example. According to the README.md, the training and testing dataset can be downloaded from the website such as kaggle. In other words, the dataset is seperatedly into training and testing dataset in advance. Now i have another dataset, how to split my dataset with imbalanced target labels into training,validate and testing dataset ?

Furthermore, the code in dataset_prep.py continue to split the training data set into train, validate and test dataset. I noted that the train_test_split function is used here, that is,
data_train, data_val = train_test_split(data_train_all, random_state=123, test_size=1000)

I just wonder that why the train_test_split is used here instead of iterative_train_test_split from http://scikit.ml/index.html , since this is the multi label dataset.

The third question is that would you like to explain the performance results listed in table 2 in your paper in details. Secifically, how to compute the Total miF/maF ,Head(≥35) miF/maF, Med(8-35)miF/maF and Tail(≤8). In particular, the model is built once with the training dataset, and test the model on the total test data and the head,med, tail sub_dataset? Is this right?

thanks.

yangjenhao commented 1 year ago

I have a similar question, why the last page in the paper. Reuters Total miF, maF and Reuters Multi-label mif, maF are different. Thanks.

blessu commented 1 year ago

Hi, @zqudm, thanks for your questions. Please find my answers below (Q1 and Q2 are quite related so they are answered together).

The first question is about the way in which the dataset is splitted. Taking Reuters-21578 as an example. According to the README.md, the training and testing dataset can be downloaded from the website such as kaggle. In other words, the dataset is seperatedly into training and testing dataset in advance. Now i have another dataset, how to split my dataset with imbalanced target labels into training,validate and testing dataset ?

Furthermore, the code in dataset_prep.py continue to split the training data set into train, validate and test dataset. I noted that the train_test_split function is used here, that is, data_train, data_val = train_test_split(data_train_all, random_state=123, test_size=1000)

I just wonder that why the train_test_split is used here instead of iterative_train_test_split from http://scikit.ml/index.html , since this is the multi label dataset.

I see your point for better data stratification of multi-label datasets, and agree iterative stratification is quite decent. Unfortunately, we used random sampling for train-val-test splits, for (1) we didn't come across the scikit-multilearn package you mentioned and (2) the number of labels and instances of PubMed are quite large, while we were evaluating by groups of them (rather than ranking). Back to the first question, if you have another dataset of similar statistics as Reuters, I would also vote for iterative stratification; for the "less decent" random sampling, you may refer to the dataset_prep step of PubMed dataset.

The third question is that would you like to explain the performance results listed in table 2 in your paper in details. Secifically, how to compute the Total miF/maF ,Head(≥35) miF/maF, Med(8-35)miF/maF and Tail(≤8). In particular, the model is built once with the training dataset, and test the model on the total test data and the head,med, tail sub_dataset? Is this right?

You are mostly right. First, model training with the training set; then, threshold selection with the best micro-F1 score on the validation set; finally, performance evaluation on the test set. Please note the terms total, head, med, and tail in performance evaluation are about the labels instead of instances (they are not "sub_dataset"s) - For example, in the Reuters test set there are ~3000 instances and ~90 labels, "Total miF/maF" is to evaluate the overall miF/maF of all 90 labels on the 3000 instances, "Head(≥35) miF/maF" is to evaluate the overall miF/maF of the 30 head labels (each with ≥35 instances in the training set) on the same 3000 instances.

Please let me know if there is any question.

blessu commented 1 year ago

Hi @YangJenHao, thanks for your question.

If you are referring to Table 3 in Appendix, the description of evaluation is mainly in the section "A.2 Additional Effectiveness Check", especially:

For the Reuters dataset, we split the test instances into two groups, 2583 instances with only one label and 436 instances with multiple labels.

Let's take the Reuters test set with ~3000 instances and ~90 labels as an example again. "Total miF/maF" is to evaluate the overall miF/maF of all 90 labels on the 3000 instances - so the numbers in this column are exactly the same as in Table 2. "Multi-label miF/maF" is to evaluate the overall miF/maF of the 90 labels on the 436 instances (each with ≥2 labels in the test set).

Roche / BalancedLossNLP

Dataset split and performance results #8