Evaluation threshold causes the information leakage

THUDM / GATNE

Source code and dataset for KDD 2019 paper "Representation Learning for Attributed Multiplex Heterogeneous Network"

MIT License

525 stars 141 forks source link

Evaluation threshold causes the information leakage #53

Closed supfisher closed 4 years ago

supfisher commented 4 years ago

Hi. In the evaluation function, you set the threshold as the value of the true_num-th of the sorted predicted scores list. However, can we previously know the number of true edges before making predictions? Is it an information leakage during training process? It is a confusing evaluation process. Please have a look and give an explanation. Thanks.

cenyk1230 commented 4 years ago

Hi @supfisher,

Thank you for your attention. For unsupervised network embedding, there are some previous methods assume that the number of labels for test data is given when computing the F1 score (You can see references [27, 29, 37] in our paper). It won't cause information leakage since we don't use the information for model training. By the way, the other two metrics, i.e., ROC-AUC and PR-AUC, don't rely on the threshold.

v587su commented 4 years ago

Hi @supfisher,

Thank you for your attention. For unsupervised network embedding, there are some previous methods assume that the number of labels for test data is given when computing the F1 score (You can see references [27, 29, 37] in our paper). It won't cause information leakage since we don't use the information for model training. By the way, the other two metrics, i.e., ROC-AUC and PR-AUC, don't rely on the threshold.

Hi @cenyk1230, So why they need to know the number of true labels? If the test set is full of true cases, the evaluation result will always be full mark. It makes the evaluation meaningless.

cenyk1230 commented 4 years ago

Hi @v587su,

In my view, this strategy provides a relatively fair comparison between different methods because all methods output the best F1-score (where precision, recall, F1-score are equal) based on their predictions. If you set fixed threshold value (e.g., 0.5) for all methods, the distributions of their predictions may affect the results and lead to unfair comparison. If you don't believe this strategy for F1-score, you can only trust two AUC metrics.