d-ailin / GDN

Implementation code for the paper "Graph Neural Network-Based Anomaly Detection in Multivariate Time Series" (AAAI 2021)
MIT License
488 stars 142 forks source link

some questions about the experimental results and threshold Selection. #59

Open cloudlessky opened 1 year ago

cloudlessky commented 1 year ago

Hello , thanks for sharing your excellent work! I have some questions about the experimental results and threshold Selection. On the SWAT dataset, I conducted experiments using the two threshold selection methods you provided. When I use the maximum error of the validation set as the threshold, F1 is 0.5. When I use the second method to search the optimal threshold on the test set, F1 is 0.80, which can achieve the effect of the paper report. Based on this, I have two questions:

  1. Which threshold selection method is corresponding to the results reported in your paper? In your experiment, do the results of the two threshold selection methods differ greatly?

  2. As for the second threshold selection method, I understand that it is to select the threshold that makes F1 the highest under the assumption that the test set anamoly label is known. But I have a question. The label of the test set is invisible in the real scene, so is this reasonable? I see that other works in recent years also adopts the optimal threshold method, so do we focus on the optimal F1 that can be achieved in theory?

I look forward to receiving your reply. Thank you very much!

DevBySam7 commented 1 year ago

I have the same Issue with the results. I can't come close to the results of the paper with the validation Threshold which is why I'm also curious about question one.

cloudlessky commented 1 year ago

这是来自QQ邮箱的自动回复邮件。你好,您的邮件我已收到,祝您心情愉快。

d-ailin commented 1 year ago

Thanks for your interest in our work.

  1. The reported results are based on the validation set threshold, such that it might be varied with different random seeds. Based on some seeds, the results of using validation-based and best F1 are very close, but for some seeds, there might be some variation.

    1. Yes, there are some works using best-F1 as evaluation metrics. As F1 scores require threshold selection, I think a better way for evaluation could also consider using some threshold-agnostic metrics, e.g., AUROC, together with F1 related metrics.