GastonGarciaGonzalez / DC-VAE

The following scripts are available in order to be able to reproduce the experiments carried out in...
Apache License 2.0
10 stars 0 forks source link

F1 and recall calculation #1

Open pmarantis opened 1 year ago

pmarantis commented 1 year ago

Hello, I would like to ask via which function the F1 and recall scores are calculated, since when running the test.py file, these metrics are not reported. Also, as a sidenote, I see that the dataset statistics in Table 1 (regarding amount of anomalies) reported in these two versions of the paper (https://www.colibri.udelar.edu.uy/jspui/bitstream/20.500.12008/31392/1/GMFGAC22.pdf and https://kdd-milets.github.io/milets2022/papers/MILETS_2022_paper_3436.pdf) are slightly different. Is there a reason for this ? (perhaps different way of counting the anomalies ?). Thank you in advance for your response !

GastonGarciaGonzalez commented 1 year ago

Hello @pmarantis,

the metrics were reported in these papers. The metrics is specific for Time-Series, it was publicated in:

N. Tatbul, T. J. Lee, S. Zdonik, M. Alam, and J. Gottschlich, “Precision and Recall for Time Series,” 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.

And the repository is here: https://github.com/CompML/PRTS

Mining Multivariate Time-Series for Anomaly Detection in Mobile Networks On the Usage of Variational Auto Encoders and Dilated Convolutions:

"We therefore take the extended definitions of recall and precision as defined in [18] to generalize for ranges of anomalies, considering a correct detection if at least one of the samples between the start and the end of the actual anomaly are flagged by the model."

I hope that helps you.

GastonGarciaGonzalez commented 1 year ago

The difference comes from an actualization of the dataset, after talking with the owner of the data, when for example we find two anomalies equal, and one was labeled and the other not.

pmarantis commented 1 year ago

@GastonGarciaGonzalez Thank you again for the response, I am aware of the metrics used and have already installed the prts library. My question is mainly about via which function of the code the results reported (F1 and recall) in the paper are returned, since the prts library is only used at the dc-vae script and recall and presicion are not returned. Also, the evaluate.py script only returns loss, reconstruction and KL but not F1 recall and precision metrics. On another note, the alpha_definition.py script returns: Alpha selection… Alpha up: [2. 2. 4. 4. 6. 3. 3. 4. 2. 2. 3. 2.] Alpha down: [2. 2. 3. 2. 2. 2. 2. 2. 2. 2. 3. 2.] max_F1: [0.15302912 0.28447039 0.24721313 0.51601846 0.12329023 0.09295889 0.48826641 0.5064329 0.04081078 0.21319088 0.36475077 0. ] indicating that F1 values for the training set are pretty low. Is this normal behavior or could it be an issue on my way of running the scripts ? (I am using the telco dataset provided in the repository and the settings ase they are, also tested for T=512 but results were similar) Finally, regarding the dataset, so after the actualization, which one of the statistics and labeling did you find to be more precise, since there is substantial difference in the number of anomalies Screenshot 2023-10-05 122824 Screenshot 2023-10-05 123006 In the dataset provided with the repository I found 3080 labeled anomalies for the testing set, indicating that the 1% of the testing data is anomalous. Thank you again for your time and help !