SeldonIO / alibi-detect

Algorithms for outlier, adversarial and drift detection
https://docs.seldon.io/projects/alibi-detect/en/stable/
Other
2.14k stars 216 forks source link

Try vae example on kdd dataset, got very bad result, why? #394

Closed qihuagao closed 2 years ago

qihuagao commented 2 years ago

Firstly thank you for sharing this great project which help me a lot.

I follow the procedure like the follow link: https://docs.seldon.io/projects/alibi-detect/en/latest/examples/od_vae_kddcup.html but I got very bad result

Is there anything missed in this doc??? I only got F1 score 0.3407, while in the doc, it says 0.9754

Looking forward to any response.

5.0% outliers New threshold: 0.2997805822912754 (1000, 18) (1000,) 10.0% outliers ['instance_score', 'feature_score', 'is_outlier'] F1 score: 0.3407

RobertSamoilescu commented 2 years ago

@qihuagao, can you please provide some code so we can reproduce the results you got? I ran myself the notebook, but still got an F1 score of 0.9754.

Thanks!

qihuagao commented 2 years ago

@qihuagao, can you please provide some code so we can reproduce the results you got? I ran myself the notebook, but still got an F1 score of 0.9754.

Thanks!

Thank you very much for the response, I just followed the link, the code is like the below.

import os
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix, f1_score
import tensorflow as tf
tf.keras.backend.clear_session()
from tensorflow.keras.layers import Dense, InputLayer

from alibi_detect.datasets import fetch_kdd
from alibi_detect.models.tensorflow.losses import elbo
from alibi_detect.od import OutlierVAE
from alibi_detect.utils.data import create_outlier_batch
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.utils.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_instance_score, plot_feature_outlier_tabular, plot_roc

logger = tf.get_logger()
logger.setLevel(logging.ERROR)
kddcup = fetch_kdd(percent10=True)  # only load 10% of the dataset
print(kddcup.data.shape, kddcup.target.shape)
np.random.seed(0)
normal_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=400000, perc_outlier=0)
X_train, y_train = normal_batch.data.astype('float'), normal_batch.target
print(X_train.shape, y_train.shape)
print('{}% outliers'.format(100 * y_train.mean()))
mean, stdev = X_train.mean(axis=0), X_train.std(axis=0)
X_train = (X_train - mean) / stdev
load_outlier_detector = False
filepath = 'my_dir'  # change to directory (absolute path) where model is downloaded
detector_type = 'outlier'
dataset = 'kddcup'
detector_name = 'OutlierVAE'
filepath = os.path.join(filepath, detector_name)
if load_outlier_detector:  # load pretrained outlier detector
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:  # define model, initialize, train and save outlier detector
    n_features = X_train.shape[1]
    latent_dim = 2

    encoder_net = tf.keras.Sequential(
        [
            InputLayer(input_shape=(n_features,)),
            Dense(20, activation=tf.nn.relu),
            Dense(15, activation=tf.nn.relu),
            Dense(7, activation=tf.nn.relu)
        ])

    decoder_net = tf.keras.Sequential(
        [
            InputLayer(input_shape=(latent_dim,)),
            Dense(7, activation=tf.nn.relu),
            Dense(15, activation=tf.nn.relu),
            Dense(20, activation=tf.nn.relu),
            Dense(n_features, activation=None)
        ])

    # initialize outlier detector
    od = OutlierVAE(threshold=None,  # threshold for outlier score
                    score_type='mse',  # use MSE of reconstruction error for outlier detection
                    encoder_net=encoder_net,  # can also pass VAE model instead
                    decoder_net=decoder_net,  # of separate encoder and decoder
                    latent_dim=latent_dim,
                    samples=5)
    # train
    od.fit(X_train,
           loss_fn=elbo,
           cov_elbo=dict(sim=.01),
           epochs=30,
           verbose=True)

    # save the trained outlier detector
    save_detector(od, filepath)
    np.random.seed(0)
perc_outlier = 5
threshold_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=perc_outlier)
X_threshold, y_threshold = threshold_batch.data.astype('float'), threshold_batch.target
X_threshold = (X_threshold - mean) / stdev
print('{}% outliers'.format(100 * y_threshold.mean()))
od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))
save_detector(od, filepath)
np.random.seed(1)
outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=10)
X_outlier, y_outlier = outlier_batch.data.astype('float'), outlier_batch.target
X_outlier = (X_outlier - mean) / stdev
print(X_outlier.shape, y_outlier.shape)
print('{}% outliers'.format(100 * y_outlier.mean()))
od_preds = od.predict(X_outlier,
                      outlier_type='instance',    # use 'feature' or 'instance' level
                      return_feature_score=True,  # scores used to determine outliers
                      return_instance_score=True)
print(list(od_preds['data'].keys()))
labels = outlier_batch.target_names
y_pred = od_preds['data']['is_outlier']
f1 = f1_score(y_outlier, y_pred)
print('F1 score: {:.4f}'.format(f1))
cm = confusion_matrix(y_outlier, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()
jklaise commented 2 years ago

@qihuagao can you post what your environment is? I.e. Python version and versions of libraries (e.g. via pip freeze if using pip).

RobertSamoilescu commented 2 years ago

@qihuagao, thanks for sharing the code. We managed to replicate the issue. Seems like the VAE is not training properly (can not tell for sure) as the lower score shows up only when training the model from scratch. We will look more into it and come back with an answer as soon as possible.

qihuagao commented 2 years ago

pip freeze

` absl-py==0.14.1 alibi-detect==0.7.2 astunparse==1.6.3 cached-property==1.5.2 cachetools==4.2.4 certifi==2021.5.30 charset-normalizer==2.0.6 clang==5.0 click==8.0.3 cloudpickle==2.0.0 cycler==0.10.0 dataclasses==0.8 decorator==5.1.0 dill==0.3.4 dm-tree==0.1.6 filelock==3.3.1 flatbuffers==1.12 gast==0.4.0 google-auth==1.35.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.34.1 h5py==3.1.0 huggingface-hub==0.0.19 idna==3.2 imageio==2.9.0 importlib-metadata==4.8.1 joblib==1.0.1 keras==2.6.0 keras-nightly==2.5.0.dev2021032900 Keras-Preprocessing==1.1.2 kiwisolver==1.3.1 Markdown==3.3.4 matplotlib==3.3.4 networkx==2.5.1 numpy==1.19.5 oauthlib==3.1.1 opencv-python==4.5.4.58 opt-einsum==3.3.0 packaging==21.0 pandas==1.1.5 Pillow==8.4.0 protobuf==3.18.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pyparsing==3.0.0 python-dateutil==2.8.2 pytz==2021.3 PyWavelets==1.1.1 PyYAML==6.0 regex==2021.10.23 requests==2.26.0 requests-oauthlib==1.3.0 rsa==4.7.2 sacremoses==0.0.46 scikit-image==0.17.2 scikit-learn==0.24.2 scipy==1.5.4 seaborn==0.11.2 six==1.15.0 sklearn==0.0 tensorboard==2.6.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.0 tensorflow==2.5.1 tensorflow-estimator==2.5.0 tensorflow-probability==0.12.2 termcolor==1.1.0 threadpoolctl==3.0.0 tifffile==2020.9.3 tokenizers==0.10.3 tqdm==4.62.3 transformers==4.11.3 typing-extensions==3.7.4.3 urllib3==1.26.7 Werkzeug==2.0.2 wrapt==1.12.1 zipp==3.6.0

`

RobertSamoilescu commented 2 years ago

@qihuagao, we performed the following experiments:

We trained 5 VAEs and used them to detect the outliers. We obtained the following F1 score [0.39, 0.96, 0.41, 0.59, 0.39].

We selected the first VAE (which obtained 0.39) and the second VAE (which obtained 0.96) and plotted the histograms of the iscore which are depicted in the following figure:

Screenshot from 2021-12-10 16-04-44

In both cases, we can identify 3 modes for the outlier score distribution. The problematic mode is the one on the left, which lies before the inferred threshold. The existence of the modes it is also noted in the current notebook.

Unfortunately, this is an artifact of both the dataset and the method itself. We will follow this issue with a PR to ensure consistency across multiple executions as the current implementation does not support it.

Please let me know if you need further clarifications and support. Thank you!

RobertSamoilescu commented 2 years ago

Issue to ensure consistency across multiple executions: #407

qihuagao commented 2 years ago

Issue to ensure consistency across multiple executions: #407

Thank you for sharing the result.

IMO, Ensuring consistency can help to clarify this. But it definitely not help to solve the problem, and only hide the insufficiency of the modal. <0.5 f1 result is almost useless for any application. it would be better if improve the modal stability by optimizing beta, modal complexity, eg. So that this notebook can be base modal as vae startup in this senario.

Thanks again for sharing this great project.. @RobertSamoilescu

RobertSamoilescu commented 2 years ago

@qihuagao, in completion to the previous comment, we did experiment with various hyperparametres, beta & sim, but the results were similar.