Anaxagor / applyBN

4 stars 0 forks source link

Продумывание кейсов по модулю поиска аномалий в данных #3

Open Anaxagor opened 3 months ago

Anaxagor commented 3 months ago

Для табличный данных скорее всего будет реализован простой алгоритм детекции аномалий. Для временных рядов скорее всего придётся дорабатывать существенно алгоритм обучения ДБС, также можно рассмотреть задачу поиска аномалий на основе изменений в структуре ДБС. Можно рассматривать БС как вспомогательный инструмент для поиска аномалий.

Roman223 commented 3 weeks ago

Intro

The foundation for solution supposed to be integrated are based on article (Sha Lu, et all. Effective Outlier Detection based on Bayesian Network and Proximity), namely ODBP . In this algorithm bn must be linear in order to estimate mean of mutivariate random variable (MRV) as linear combination of parents' values.

Algorithm

If we want to use for oulier detection not only density estimation but also a proximity estimation we need a way to combine them. In this work authors suggested just sum up all outlier factors from detectors. Greater the factor greater the chance that given sample is outlier.

Foundation

Density estimators stores information about how dense points in space are. In this terms, outlier is a point that has too deviant values from expectation of MRV. Note that if values are on boundary of point cloud, it is hard to detect such a point. Authors solve this issue by adjusting proximity metric.

Essence

Model estimation

Theirs: Given data D and corresponding bn. For every node that has any parents, estimate its mean as linear combination depending on parents values. It is supposed here that for sample of fixed child variable, mean must converge ;into its true value whereas variance converges into zero. Any step away is treated as anomaly either in child variable or in parents variables (or both). Final outlier factor is just Z-score for given child node. Therefore, another limitation comes into game: we must guarantee a normal distribution for out estimations. Despite no proof was provided, this statement probably holds by virtue of linearity of BN. Ours: Since we have mixed data, we have no guaratee of linearity at all. Moreover, we cannot apply Z-score when use mean estimation in general, because for conditional node for every combination of parents we store a different model (not only Gaussian, it can be any), but we still can estimate mean and thus can find deviation from it with some errors (such MAE or MSE). Discrete RV are coded and categories are used as new RV. Here what is done:

node = self.estimator.bn[node_name]
diff = []
dist = self.estimator.bn.distributions[node_name]
parents = node.cont_parents + node.disc_parents

for _, row in X.iterrows():
    pvalues = row[parents].to_dict()

    pvals_bamt_style = [pvalues[parent] for parent in parents]
    cond_dist = self.estimator.bn.get_dist(node_name, pvals=pvalues)
    # not good written, subject to change after bamt 2 release.
    if isinstance(cond_dist, tuple):
        # if returns mean and variances
        cond_mean = cond_dist[0]
    else:
        # otherwise
        dispvals = []
        for pval in pvals_bamt_style:
            if isinstance(pval, str):
                dispvals.append(pval)

        if "vals" in dist.keys():
            classes = dist["vals"]
        elif "classes" in dist.keys():
            classes = dist["classes"]
        elif "hybcprob" in dist.keys():
            if "classes" in dist["hybcprob"][str(dispvals)]:
                classes = dist["hybcprob"][str(dispvals)]["classes"]
            else:
                raise Exception()
        else:
            raise Exception()
        # get list of classes for random variable to calculate weighted sum (expectation)
        classes_coded = np.asarray([self.encoding[node_name][class_name] for class_name in classes])
        cond_mean = classes_coded @ np.asarray(cond_dist).T
    if isinstance(row[node_name], str):
        # MAE
        diff.append(abs(cond_mean - self.encoding[node_name][row[node_name]]))
    else:
        # MAE
        diff.append(abs(cond_mean - row[node_name]))

    # Subject to vary, explaination in text
    scaler = StandardScaler()
    diff_scaled = scaler.fit_transform(np.asarray(diff).reshape(-1, 1))
    return diff_scaled

Proximity estimation

Theirs: The idea is simple, just choose randomly t columns from initial dataset, then find Local Outlier Factor (LOF) for given subsample.

LOF has values less then 1 if local density (density of neighbors) is high (so point is not anomalous), 1 if it is the same and greater than 1 if local density is low (outliners).

Again here it is assumed that obtained values will be distributed normally and Z-score is possible to apply.

Problem here arises when columns are discrete, LOF cannot be calculated here. This operation is a body of loop, single loop is called "detector" by the authors. Its number is hyperparameter. PS: Simple explaination on LOF here. Ours: For now we just drop discrete columns if they apears in random subsample.

On outlier score

After obtainig both local scores we need to merge them into one. Authors' suggestion is just sum up all non zero $score_{ij}$ (i - number of sample, j - number of detector).

On shapes

All local scores has shape (n, d), where n - number of sample, d - number of detectors / number of child nodes. There are two summation: first -- over d-axis, next -- over proximity scores (n,) and model scores (n,). Final score has shape (n,).

First experiments will be posted as single posts with corresponding discussion.

Roman223 commented 1 week ago

Experiments with synth data

ODBP algorithm was firstly tested on synth data in order to obtain a brief picture of its actual ability to distinguish outliers. As it is stated in the article we need two two types of outliers, namely density-based and model based. But since we have nonlinear data we need to change model scorer, so we have to use pointwise expectation estimation. Therefore, we have used scalers here carefully. Note that during computation we removed proximity factors less than 0, since they coresponds to inliners. Also we scaled on variance from data (i.e. condtional).

Design

Four features was created in general. 3 of them are indepndent and 1 is some random linear combination of 2 inpedndent features ($f=a x1 + b x2 + a b$, where x1, x2 - features from data, a,b - some random constants). To test nonlinear ones we used $f = a x1^2 + b x2 + a b$.

df = pd.DataFrame({"feature1": np.random.normal(scale=VAR_F1, size=1000, loc=MEAN_F1),
                   "feature2": np.random.normal(scale=VAR_F2, size=1000, loc=MEAN_F2),
                   "feature4": np.random.normal(scale=A, size=1000, loc=B),
                   })

df["feature3"] = f(df["feature1"], df["feature2"], A, B)

Anomaly injections

Out-of-density

We injected normal noise in 5% of data.

Out of dependence

We have learned linear regression from feature1 to feature2, took some random points from feature1, gave them a random noise, and predicted feature2, the result is points shifted across regression line.

Thresholding

To obtain labels we varied threshold from 1 up to max(ourliers_scores). Divide this range into 100 points and evaluate score on each .

Exp.1.1 (ONLY out-of-density anomalies and linear case)

MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10

PROX_SHIFTS = {"loc": 200, "scale": 40}
ANOMALY_PROBA = 0.05

Number of evaluations = 100

Scoring function for bn F1 score
K2 * 0.827 ± 0.083
MI 0.461 ± 0.161
BIC 0.390 ± 0.059

* K2 with pp.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

Exp.1.2 (ONLY out-of-density anomalies, hard case, linear data)

MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10

PROX_SHIFTS = {"loc": 150, "scale": 100} # <- change here to make it difficult to distinguish
ANOMALY_PROBA = 0.05

Number of evaluations = 100

Scoring function for bn F1 score
K2 * 0.661 ± 0.081
MI 0.38 ± 0.115
BIC 0.343 ± 0.056

* K2 with pp.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

Additional

np.random.seed(20)

Sample pairplot with anomalies (PROX_SHIFTS = {"loc": 100, "scale": 40}) image

Result here is 0.762 of F1 Score.


Sample pairplot with anomalies (PROX_SHIFTS = {"loc": 150, "scale": 100}) image

Result here is 0.651 of F1 Score.

Conclusion

On linear case this method is very sensitive to distribution of anomalies. The higher anomaliles vary, the lower f1 score model has.

Spoiler: on nonlinear it seems the result would be the same.

Roman223 commented 1 week ago

Exp.1.3 (Nonlinear data)

MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10

PROX_SHIFTS = {"loc": 4000, "scale": 100} # <- loc was increased because nonlinear function has wider range
ANOMALY_PROBA = 0.05
Scoring function for bn F1 score
K2 * 0.264 ± 0.052
MI 0.153 ± 0.050
BIC 0.186 ± 0.042

More detailed analysis shows that with givenndata generators bn's structure has only 1 edge in most cases ( feature1 -> feature3). Our first suggestion here was a manual setting of structure. So:

detector.estimator.bn.add_nodes(info)
detector.estimator.bn.set_structure(edges=[["feature1", 'feature3'], ["feature2", "feature3"]])
detector.estimator.bn.fit_parameters(df)

And here we obtained 0.903 ± 0.032 f1 score (in 100 evaluations)! It turns out that in nonlinear case the impact of structure learning becomes so huge and we must be sure that trained structure coresponds to data enough to detect outliers.

Additionally, we made the task harder:

MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10

PROX_SHIFTS = {"loc": 2000, "scale": 100}
ANOMALY_PROBA = 0.05

result: 0.598 ± 0.024

Plots

Sample with huge shifts image

Sample with small shifts image

Conclusion

It seems that despite of huge shifts, the ODBP method tends to predict false positives results if data points overlap.

Roman223 commented 2 days ago

Experiments with real dataset (Data with only continuous features)

Datasets:

  1. Cardiotocogrpahy dataset (labeled)
  2. Ecoli dataset (Among the 8 classes omL, imL, and imS are the minority classes and used as outliers)
  3. Wisconsin-Breast Cancer (Diagnostics) dataset (There are two classes, benign and malignant. The malignant class of this dataset is downsampled to 21 points, which are considered as outliers, while points in the benign class are considered inliers)

Design

  1. StratifiedShuffleSplit and StratifiedKFold were used to conduct cross-validation.
  2. Only K2 was used.
  3. Bn with mixutres were evaluated. During model step diff isX - cond_mean(no scaler applied).
  4. Thresholded with brute force.
  5. Number of proximity detectors is 45.

Cardio (df.shape = (1831, 21))

With no mixtures

CV (StratifiedShuffleSplit) F1 score
5 0.56035 ± 0.12931
10 0.5797 ± 0.04045
15 0.59161 ± 0.10089
20 0.57891 ± 0.06204
CV (StratifiedKFold) F1 score
5 0.45327 ± 0.16103
10 0.59645 ± 0.25846
15 0.6596 ± 0.20554
20 0.75584 ± 0.21167

With mixtures

CV (StratifiedShuffleSplit) F1 score
5 0.55688 ± 0.09371
10 0.52679 ± 0.08216
15 0.58938 ± 0.06002
20 0.58414 ± 0.08683
CV (StratifiedKFold) F1 score
5 0.41645 ± 0.14037
10 0.57498 ± 0.26223
15 0.64924 ± 0.19153
20 0.735 ± 0.22238

Ecoli (df.shape = (336, 7))

With no mixtures

CV (StratifiedShuffleSplit) F1 score
5 0.88 ± 0.09798
10 0.78562 ± 0.35948
15 0.92 ± 0.22601
20 0.77943 ± 0.36025
CV (StratifiedKFold) F1 score
5 0.88 ± 0.09798
10 0.9 ± 0.3 # [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0]
15 0.54667 ± 0.48699 # [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 1.0, 1.0]
20 0.17504 ± 0.29517

With mixtures

CV (StratifiedShuffleSplit) F1 score
5 0.59444 ± 0.38313
10 0.47558 ± 0.36933
15 0.68431 ± 0.38111
20 0.77123 ± 0.2998
CV (StratifiedKFold) F1 score
5 0.76 ± 0.13064
10 0.81667 ± 0.32016
15 0.56667 ± 0.47842
20 0.05176 ± 0.05742
Roman223 commented 2 days ago

WBC (df.shape = (378, 30))

With no mixtures

CV (StratifiedShuffleSplit) F1 score
5 0.674 ± 0.073
10 0.645 ± 0.19
15 0.701 ± 0.185
20 0.724 ± 0.191
CV (StratifiedKFold) F1 score
5 0.656 ± 0.185
10 0.674 ± 0.159
15 0.695 ± 0.217
20 0.755 ± 0.256

With mixtures

CV (StratifiedShuffleSplit) F1 score
5 0.693 ± 0.053
10 0.638 ± 0.206
15 0.697 ± 0.197
20 0.729 ± 0.198
CV (StratifiedKFold) F1 score
5 0.690 ± 0.150
10 0.723 ± 0.127
15 0.752 ± 0.238
20 0.443 ± 0.333

Conclusion

Cardio

  1. The lower std, the lower mean, it is not a good trend.
  2. High scores are not stable.

    Ecoli

  3. High scores with small stds. Algorithm performs well on this data.

    WBC

  4. Stds are still wide, but less then on Cardio. Mean values are mostly around 0.7.

It seems that for most cases cv around 15-20 is correlated with high f1 score (perhaps suitable only for this datasets pull).