Kanatoko / XBOS-anomaly-detection

XBOS Anomaly Detection
15 stars 2 forks source link

Some questions #3

Closed nkaenzig closed 3 years ago

nkaenzig commented 6 years ago

I think there are some issues with the code as is:

First of all, I think cluster_score=dict(assign.groupby('cluster').apply(len).apply(lambda x:x/length))

should be replaced by cluster_score=dict(assign['cluster'].value_counts().apply(lambda x: x / length)) for i in range(self.n_clusters): if i not in cluster_score: cluster_score[i] = 0

... to prevent key errors.

Second:

for column in data.columns: kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter, random_state=0) self.kmeans[column]=kmeans kmeans.fit(data[column].values.reshape(-1,1))

Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. What is the motivation behind this?

Third: sorted_centers = sorted(kmeans.cluster_centers_) max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]

...To me this doesn't seem to compute the max distance between your centers.

Kanatoko commented 6 years ago

Thank you very much for your warm message. I mainly use Java implementation of XBOS for my production and this Python version is just for PoC. (Java version is not OSS)

I think this Python version works for my data and this is the proof. https://www.kaggle.com/kanatoko/unsupervised-anomaly-detection-xbos-hbos-iforest

Kanatoko commented 6 years ago

Do you want me to add more validation for data?

Kanatoko commented 6 years ago

Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. This doesn't make any sense for me.

XBOS assumes independence of the features as same as HBOS. Is this helps?

nkaenzig commented 6 years ago

Thanks a lot for your comments.

XBOS assumes independence of the features as same as HBOS. Is this helps?

I'm not sure how HBOS works. Running K-means on each feature vector under the independence assumption indeed makes more sense.

Thanks for the notebook you referenced.

I tested XBOS, HBOS ( your implementation), Isolation Forest on the CICIDS2017 dataset. To prevent Key errors in XBOS, I had to implement the modification in the first code snippet.

While HBOS, and Isolation Forest both could detect some of the outliers, the XBOS model did not, maybe something with my setup is wrong.