arunppsg / TadGAN

Code for the paper "TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks"
MIT License
153 stars 34 forks source link

TadGAN does not work with the default setup #5

Open rruizdeaustri opened 3 years ago

rruizdeaustri commented 3 years ago

Hi,

I have tried to run the code with the current setup (number of epochs is 30) but I get

File TadGAN/anomaly_detection.py", line 129, in find_scores precision = tp / (tp + fp) ZeroDivisionError: division by zero

Any ideas about what is going on ?

With Kind Regards, Roberto

arunppsg commented 3 years ago

Hi Roberto,

The dataset example-2_cpc_results.csv does not contain any negative points. Hence, tp=0. The model also detects all points as negative. Hence, fp=0.

The attached dataset is not the write one to evaluate the model (sorry for the unnecessary hurdle) since it does not contain any anomalous point. I need to update it with some other time series anomaly detection dataset. You can see here on using the code with other dataset.

Thanks, Arun

rruizdeaustri commented 3 years ago

Hi Arun,

Ok, then I'll try with another dataset.

Thanks a lot !

Best, Rbt

rruizdeaustri commented 3 years ago

Hi Arun,

I have labeled the nyc_taxi.csv dataset from NAB and I have a question about the split of the data used in your code. As it is, 70% of the data is used for training and 30% for testing but in this way the training data contain anomalies for this particular dataset. Since the method is unsupervised, shouldn't anomalies be excluded in the training process ? I guess we want to learn the distribution of the say normal samples, right ?

Thanks a lot !!

All the best, Roberto

arunppsg commented 3 years ago

Hi Roberto,

The anomalies are excluded in training process. The anomaly values are used only for evaluation process and not during training. Training uses the time series signals. The generator learns the distribution of normal samples.

Cheers, Arun.

rruizdeaustri commented 3 years ago

Hi Arun,

Yes this is what I expect though in some blog about the model in Orion have seen they use the whole time series (including anomalous timesteps). That is why I got confused.

I will split the data and pickup just normal data and let you know whether the code works with this dataset as it does with the "official" implementation in Orion.

BTW, have you tried with this dataset ? I could send it to you with the right format for your code.

Thanks a lot !!

Best, Rbt

arunppsg commented 3 years ago

Hi Roberto,

Thanks for your interest. Training of GANs are highly unstable and it requires more computation power. Access to computation power is currently out of scope for me.

Best, Arun.

rruizdeaustri commented 3 years ago

Hi Arun,

In fact I have been training the model and the performance is really poor for this dataset in comparison with what is reported in the Orion webpage for the say official version.

I have used the default hyperparameters which are identical to the ones used in the report by the Orion guys:

Accuracy 0.79 Precision 1.00 Recall 0.07 F1 Score 0.13

Any advice to improve this ?

Thanks a lot !!!

Rbt

arunppsg commented 3 years ago

Hi Rbt,

The same was the result observed in my scenario. But the loss value seems to improve in the right direction after successive epochs. I don't have any particular advice other than the following:

Best, Arun.

amanuel2 commented 3 years ago

Can one of you send a CSV file that works with this source code? (I get the same error) I can't find any online.

natkhosh commented 3 years ago

Hi Arun,

Yes this is what I expect though in some blog about the model in Orion have seen they use the whole time series (including anomalous timesteps). That is why I got confused.

I will split the data and pickup just normal data and let you know whether the code works with this dataset as it does with the "official" implementation in Orion.

BTW, have you tried with this dataset ? I could send it to you with the right format for your code.

Thanks a lot !!

Best, Rbt

Hi, could you please send me your dataset. I'll try to use it in my diploma work. I have the same problem with datasets (I tried NAB too).

rruizdeaustri commented 3 years ago

Hi Arun,

Maybe I can send you the data and you can add them to the repo ?

Rbt

arunppsg commented 3 years ago

Adding your data to repo will be great. You can make a pull request with that data and I will merge it.

Best, Arun.

On Mon, 12 Jul 2021, 15:59 rruizdeaustri, @.***> wrote:

Hi Arun,

Maybe I can send you the data and you can add them to the repo ?

Rbt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/arunppsg/TadGAN/issues/5#issuecomment-878163610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJNAINKYCEVQSG7N62LAODTXK7W3ANCNFSM47DTU7NA .

rruizdeaustri commented 3 years ago

Hi Arun,

I have created a branch called rruiz-branch where the file nyc_taxi_new.csv has been added and made a pull request. Could you pls merge it ?

Best, Rbt

arunppsg commented 3 years ago

You need to create a pull request. I don't see any pull request currently.

AugustComte commented 2 years ago

Hi @arunppsg,

Firstly thank you for this, its super cool. I am new to this and have a few questions, which I hope are not too stupid, if you can indulge me?

Looking through this I notice both this and the Orion examples only use a value and date column, it it possible to make this work with additional regressors/columns, so called Xregs i.e. temperature, sales price etc.

Secondly is it necessary to have the labelled anomalies? My anomaly labels (in my datasets) were achieved by using the deviation between a true value and predicted with an RNN, I am expecting tadGAN to be better. So it does not seem appropriate to measure the GAN performance by the results of the RNN, I was under the impression that tadGAN was unsupervised. All I really want is to get the anomaly scores. Does that mean I would need to delete the evaluation section of the code, or will it run regardless and output the outlier scores? Where can I get these?

Again, sorry if these are poor questions. I'm not sure I entirely understand the code.

Best August

arunppsg commented 2 years ago

Hello August,

  1. You can also use other variables but for that you might need to change model architecture. I am not sure on how we can change it. Maybe I will think through it and get back to you after some time. In the current architecture, there is only one regressor and it is normalized first, and then the input is a window of data points (window size: 100 * 1). Consider giving a read through this paper for using Multivariate time-series with RNNs.
  2. Labelled anomalies are not necessary since it is an unsupervised approach. Labels are only required to evaluate the model. Anomaly scores are the computed as product of reconstruction error and critic score. See the test function in anomaly_detection.py for anomaly scores. To use it without labels, just create a dummy column called anomaly or modify code in main.py and anomaly_detection.py

Thanks!

The-Boyy commented 2 years ago

Hi Arun,

I have labeled the nyc_taxi.csv dataset from NAB and I have a question about the split of the data used in your code. As it is, 70% of the data is used for training and 30% for testing but in this way the training data contain anomalies for this particular dataset. Since the method is unsupervised, shouldn't anomalies be excluded in the training process ? I guess we want to learn the distribution of the say normal samples, right ?

Thanks a lot !!

All the best, Roberto

Excuse me, can you send me your dataset? Thanks!