Closed nkaenzig closed 3 years ago
Thank you very much for your warm message. I mainly use Java implementation of XBOS for my production and this Python version is just for PoC. (Java version is not OSS)
I think this Python version works for my data and this is the proof. https://www.kaggle.com/kanatoko/unsupervised-anomaly-detection-xbos-hbos-iforest
Do you want me to add more validation for data?
Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. This doesn't make any sense for me.
XBOS assumes independence of the features as same as HBOS. Is this helps?
Thanks a lot for your comments.
XBOS assumes independence of the features as same as HBOS. Is this helps?
I'm not sure how HBOS works. Running K-means on each feature vector under the independence assumption indeed makes more sense.
Thanks for the notebook you referenced.
I tested XBOS, HBOS ( your implementation), Isolation Forest on the CICIDS2017 dataset. To prevent Key errors in XBOS, I had to implement the modification in the first code snippet.
While HBOS, and Isolation Forest both could detect some of the outliers, the XBOS model did not, maybe something with my setup is wrong.
I think there are some issues with the code as is:
First of all, I think
cluster_score=dict(assign.groupby('cluster').apply(len).apply(lambda x:x/length))
should be replaced by
cluster_score=dict(assign['cluster'].value_counts().apply(lambda x: x / length)) for i in range(self.n_clusters): if i not in cluster_score: cluster_score[i] = 0
... to prevent key errors.
Second:
for column in data.columns: kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter, random_state=0) self.kmeans[column]=kmeans kmeans.fit(data[column].values.reshape(-1,1))
Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. What is the motivation behind this?
Third:
sorted_centers = sorted(kmeans.cluster_centers_) max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]
...To me this doesn't seem to compute the max distance between your centers.