Open Anaxagor opened 3 months ago
The foundation for solution supposed to be integrated are based on article (Sha Lu, et all. Effective Outlier Detection based on Bayesian Network and Proximity), namely ODBP . In this algorithm bn must be linear in order to estimate mean of mutivariate random variable (MRV) as linear combination of parents' values.
If we want to use for oulier detection not only density estimation but also a proximity estimation we need a way to combine them. In this work authors suggested just sum up all outlier factors from detectors. Greater the factor greater the chance that given sample is outlier.
Density estimators stores information about how dense points in space are. In this terms, outlier is a point that has too deviant values from expectation of MRV. Note that if values are on boundary of point cloud, it is hard to detect such a point. Authors solve this issue by adjusting proximity metric.
Theirs: Given data D and corresponding bn. For every node that has any parents, estimate its mean as linear combination depending on parents values. It is supposed here that for sample of fixed child variable, mean must converge ;into its true value whereas variance converges into zero. Any step away is treated as anomaly either in child variable or in parents variables (or both). Final outlier factor is just Z-score for given child node. Therefore, another limitation comes into game: we must guarantee a normal distribution for out estimations. Despite no proof was provided, this statement probably holds by virtue of linearity of BN. Ours: Since we have mixed data, we have no guaratee of linearity at all. Moreover, we cannot apply Z-score when use mean estimation in general, because for conditional node for every combination of parents we store a different model (not only Gaussian, it can be any), but we still can estimate mean and thus can find deviation from it with some errors (such MAE or MSE). Discrete RV are coded and categories are used as new RV. Here what is done:
node = self.estimator.bn[node_name]
diff = []
dist = self.estimator.bn.distributions[node_name]
parents = node.cont_parents + node.disc_parents
for _, row in X.iterrows():
pvalues = row[parents].to_dict()
pvals_bamt_style = [pvalues[parent] for parent in parents]
cond_dist = self.estimator.bn.get_dist(node_name, pvals=pvalues)
# not good written, subject to change after bamt 2 release.
if isinstance(cond_dist, tuple):
# if returns mean and variances
cond_mean = cond_dist[0]
else:
# otherwise
dispvals = []
for pval in pvals_bamt_style:
if isinstance(pval, str):
dispvals.append(pval)
if "vals" in dist.keys():
classes = dist["vals"]
elif "classes" in dist.keys():
classes = dist["classes"]
elif "hybcprob" in dist.keys():
if "classes" in dist["hybcprob"][str(dispvals)]:
classes = dist["hybcprob"][str(dispvals)]["classes"]
else:
raise Exception()
else:
raise Exception()
# get list of classes for random variable to calculate weighted sum (expectation)
classes_coded = np.asarray([self.encoding[node_name][class_name] for class_name in classes])
cond_mean = classes_coded @ np.asarray(cond_dist).T
if isinstance(row[node_name], str):
# MAE
diff.append(abs(cond_mean - self.encoding[node_name][row[node_name]]))
else:
# MAE
diff.append(abs(cond_mean - row[node_name]))
# Subject to vary, explaination in text
scaler = StandardScaler()
diff_scaled = scaler.fit_transform(np.asarray(diff).reshape(-1, 1))
return diff_scaled
Theirs: The idea is simple, just choose randomly t columns from initial dataset, then find Local Outlier Factor (LOF) for given subsample.
LOF has values less then 1 if local density (density of neighbors) is high (so point is not anomalous), 1 if it is the same and greater than 1 if local density is low (outliners).
Again here it is assumed that obtained values will be distributed normally and Z-score is possible to apply.
Problem here arises when columns are discrete, LOF cannot be calculated here. This operation is a body of loop, single loop is called "detector" by the authors. Its number is hyperparameter. PS: Simple explaination on LOF here. Ours: For now we just drop discrete columns if they apears in random subsample.
After obtainig both local scores we need to merge them into one. Authors' suggestion is just sum up all non zero $score_{ij}$ (i - number of sample, j - number of detector).
All local scores has shape (n, d), where n - number of sample, d - number of detectors / number of child nodes. There are two summation: first -- over d-axis, next -- over proximity scores (n,) and model scores (n,). Final score has shape (n,).
First experiments will be posted as single posts with corresponding discussion.
ODBP algorithm was firstly tested on synth data in order to obtain a brief picture of its actual ability to distinguish outliers. As it is stated in the article we need two two types of outliers, namely density-based and model based. But since we have nonlinear data we need to change model scorer, so we have to use pointwise expectation estimation. Therefore, we have used scalers here carefully. Note that during computation we removed proximity factors less than 0, since they coresponds to inliners. Also we scaled on variance from data (i.e. condtional).
Four features was created in general. 3 of them are indepndent and 1 is some random linear combination of 2 inpedndent features ($f=a x1 + b x2 + a b$, where x1, x2 - features from data, a,b - some random constants). To test nonlinear ones we used $f = a x1^2 + b x2 + a b$.
df = pd.DataFrame({"feature1": np.random.normal(scale=VAR_F1, size=1000, loc=MEAN_F1),
"feature2": np.random.normal(scale=VAR_F2, size=1000, loc=MEAN_F2),
"feature4": np.random.normal(scale=A, size=1000, loc=B),
})
df["feature3"] = f(df["feature1"], df["feature2"], A, B)
We injected normal noise in 5% of data.
We have learned linear regression from feature1 to feature2, took some random points from feature1, gave them a random noise, and predicted feature2, the result is points shifted across regression line.
To obtain labels we varied threshold from 1 up to max(ourliers_scores). Divide this range into 100 points and evaluate score on each .
MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10
PROX_SHIFTS = {"loc": 200, "scale": 40}
ANOMALY_PROBA = 0.05
Number of evaluations = 100
Scoring function for bn | F1 score |
---|---|
K2 * | 0.827 ± 0.083 |
MI | 0.461 ± 0.161 |
BIC | 0.390 ± 0.059 |
* K2 with pp.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10
PROX_SHIFTS = {"loc": 150, "scale": 100} # <- change here to make it difficult to distinguish
ANOMALY_PROBA = 0.05
Number of evaluations = 100
Scoring function for bn | F1 score |
---|---|
K2 * | 0.661 ± 0.081 |
MI | 0.38 ± 0.115 |
BIC | 0.343 ± 0.056 |
* K2 with pp.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
np.random.seed(20)
Sample pairplot with anomalies (PROX_SHIFTS = {"loc": 100, "scale": 40})
Result here is 0.762 of F1 Score.
Sample pairplot with anomalies (PROX_SHIFTS = {"loc": 150, "scale": 100})
Result here is 0.651 of F1 Score.
On linear case this method is very sensitive to distribution of anomalies. The higher anomaliles vary, the lower f1 score model has.
Spoiler: on nonlinear it seems the result would be the same.
MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10
PROX_SHIFTS = {"loc": 4000, "scale": 100} # <- loc was increased because nonlinear function has wider range
ANOMALY_PROBA = 0.05
Scoring function for bn | F1 score |
---|---|
K2 * | 0.264 ± 0.052 |
MI | 0.153 ± 0.050 |
BIC | 0.186 ± 0.042 |
More detailed analysis shows that with givenndata generators bn's structure has only 1 edge in most cases ( feature1 -> feature3). Our first suggestion here was a manual setting of structure. So:
detector.estimator.bn.add_nodes(info)
detector.estimator.bn.set_structure(edges=[["feature1", 'feature3'], ["feature2", "feature3"]])
detector.estimator.bn.fit_parameters(df)
And here we obtained 0.903 ± 0.032 f1 score (in 100 evaluations)! It turns out that in nonlinear case the impact of structure learning becomes so huge and we must be sure that trained structure coresponds to data enough to detect outliers.
Additionally, we made the task harder:
MEAN_F1, VAR_F1 = 4.5, 10
MEAN_F2, VAR_F2 = 3.5, 5
A, B = 5, 10
PROX_SHIFTS = {"loc": 2000, "scale": 100}
ANOMALY_PROBA = 0.05
result: 0.598 ± 0.024
Sample with huge shifts
Sample with small shifts
It seems that despite of huge shifts, the ODBP method tends to predict false positives results if data points overlap.
Datasets:
X - cond_mean
(no scaler applied).CV (StratifiedShuffleSplit) | F1 score |
---|---|
5 | 0.56035 ± 0.12931 |
10 | 0.5797 ± 0.04045 |
15 | 0.59161 ± 0.10089 |
20 | 0.57891 ± 0.06204 |
CV (StratifiedKFold) | F1 score |
---|---|
5 | 0.45327 ± 0.16103 |
10 | 0.59645 ± 0.25846 |
15 | 0.6596 ± 0.20554 |
20 | 0.75584 ± 0.21167 |
CV (StratifiedShuffleSplit) | F1 score |
---|---|
5 | 0.55688 ± 0.09371 |
10 | 0.52679 ± 0.08216 |
15 | 0.58938 ± 0.06002 |
20 | 0.58414 ± 0.08683 |
CV (StratifiedKFold) | F1 score |
---|---|
5 | 0.41645 ± 0.14037 |
10 | 0.57498 ± 0.26223 |
15 | 0.64924 ± 0.19153 |
20 | 0.735 ± 0.22238 |
CV (StratifiedShuffleSplit) | F1 score |
---|---|
5 | 0.88 ± 0.09798 |
10 | 0.78562 ± 0.35948 |
15 | 0.92 ± 0.22601 |
20 | 0.77943 ± 0.36025 |
CV (StratifiedKFold) | F1 score |
---|---|
5 | 0.88 ± 0.09798 |
10 | 0.9 ± 0.3 # [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0] |
15 | 0.54667 ± 0.48699 # [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 1.0, 1.0] |
20 | 0.17504 ± 0.29517 |
CV (StratifiedShuffleSplit) | F1 score |
---|---|
5 | 0.59444 ± 0.38313 |
10 | 0.47558 ± 0.36933 |
15 | 0.68431 ± 0.38111 |
20 | 0.77123 ± 0.2998 |
CV (StratifiedKFold) | F1 score |
---|---|
5 | 0.76 ± 0.13064 |
10 | 0.81667 ± 0.32016 |
15 | 0.56667 ± 0.47842 |
20 | 0.05176 ± 0.05742 |
CV (StratifiedShuffleSplit) | F1 score |
---|---|
5 | 0.674 ± 0.073 |
10 | 0.645 ± 0.19 |
15 | 0.701 ± 0.185 |
20 | 0.724 ± 0.191 |
CV (StratifiedKFold) | F1 score |
---|---|
5 | 0.656 ± 0.185 |
10 | 0.674 ± 0.159 |
15 | 0.695 ± 0.217 |
20 | 0.755 ± 0.256 |
CV (StratifiedShuffleSplit) | F1 score |
---|---|
5 | 0.693 ± 0.053 |
10 | 0.638 ± 0.206 |
15 | 0.697 ± 0.197 |
20 | 0.729 ± 0.198 |
CV (StratifiedKFold) | F1 score |
---|---|
5 | 0.690 ± 0.150 |
10 | 0.723 ± 0.127 |
15 | 0.752 ± 0.238 |
20 | 0.443 ± 0.333 |
It seems that for most cases cv around 15-20 is correlated with high f1 score (perhaps suitable only for this datasets pull).
Для табличный данных скорее всего будет реализован простой алгоритм детекции аномалий. Для временных рядов скорее всего придётся дорабатывать существенно алгоритм обучения ДБС, также можно рассмотреть задачу поиска аномалий на основе изменений в структуре ДБС. Можно рассматривать БС как вспомогательный инструмент для поиска аномалий.