GuansongPang / deviation-network

Source code of the KDD19 paper "Deep anomaly detection with deviation networks", weakly/partially supervised anomaly detection, few-shot anomaly detection, semi-supervised anomaly detection
GNU General Public License v3.0
145 stars 56 forks source link

Using deviation networks on custom datasets #3

Closed jakubkarczewski closed 4 years ago

jakubkarczewski commented 4 years ago

Hey again!

Do you have any advice on how to approach usage of deviation networks on real world datasets?

Main concerns when using real world data:

Having that in mind, do you have any advice about how to approach new datasets?

Do you think of any method that would let me check if my dataset's anomaly scores fit normal distribution and (if not) guide me towards other distribution type?

Have you considered using other distributions or other distribution parameters during your research?

I've seen that you coauthored other paper that solves similar problem. Do you plan to release it's implementation?

Thanks again and cheers, Kuba

GuansongPang commented 4 years ago

Hey again!

Do you have any advice on how to approach usage of deviation networks on real world datasets?

Main concerns when using real world data:

  • due to feature engineering many columns are dependent (so central limit theorem doesn't quite apply - the variables are not independent)
  • there are many missing values (imputation with mean&mode helps but it may skew the distribution even more)

Having that in mind, do you have any advice about how to approach new datasets?

Do you think of any method that would let me check if my dataset's anomaly scores fit normal distribution and (if not) guide me towards other distribution type?

Have you considered using other distributions or other distribution parameters during your research?

I've seen that you coauthored other paper that solves similar problem. Do you plan to release it's implementation?

Thanks again and cheers, Kuba

Hi Kuba,

I think the general cross-validation before deploying the model is one way. There is no way that we can get to know the true distribution of the anomaly scores. The Gaussian prior generally holds. I haven't considered other priors. The code of my other work may be released after the paper is accepted.

In terms of the feature engineering and handling missing values, they are out of my expertise.

Thanks, Guansong