elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
780 stars 321 forks source link

Implementation of ODIN does not comply with the original definition #61

Closed lenaWitterauf closed 5 years ago

lenaWitterauf commented 5 years ago

In the publishing paper of ODIN, ODIN is defined as global outlier detection approach. The ODIN outlier score is calculated as indegree of an observation regarding the weighted kNN-Graph. The weights in the kNN-Graph are calculated with distances between the particular observations.

The implementation in ELKI however uses the non-weighted kNN-Graph to calculate the indegree for the ODIN score ( more precise,all weights in the kNN-Graph are equal to 1/k). This change not only does not comply with the original paper. It also makes ODIN a local outlier detection approach.

Source: V. Hautamaki, I. Karkkainen and P. Franti, "Outlier detection using k-nearest neighbour graph," Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Cambridge, 2004, pp. 430-433 Vol.3. doi: 10.1109/ICPR.2004.1334558

kno10 commented 5 years ago

Please refer to "Algorithm 1 ODIN" in that paper. As the name emphasizes, it uses the in-degree, not weights, of the kNN graph. So it is unweighted and local (why do you think it is global?). It does not use the edge weights (in fact, the authors never properly define weights that would work for this purpose, because large distances have high weights and that would actually spoil the result). The distances are only used in the alternate method MeanDIST, which supposedly is equivalent to the one previously published by Angiulli and Pizzuti.

lenaWitterauf commented 5 years ago

Thanks for clarifying so quickly! I read ODIN being titled as 'global method' in another paper which too used ELKI [1] and confused both definitions, sorry for that.

[1] Campos, Guilherme O., et al. "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study." Data Mining and Knowledge Discovery 30.4 (2016): 891-927.

kno10 commented 5 years ago

See also "Improving K-Means by Outlier Removal" http://cs.joensuu.fi/~villeh/35400978.pdf by the ODIN authors. They consider it to be local: "Outlier Detection using Indegree Number(ODIN) [9] is a local density-based outlier detection algorithm." and "where ind(xi) is the indegree of the vertex xi, i.e. the number of edges pointing to xi" - clearly unweighted. In that paper the authors do, however, use a different scoring. Instead of directly using the in-degree, they use 1/(ind(xi)+1) there, such that inliers have low scores. In ELKI - and that is documented in the class description - we do scale the degrees by 1/k, to make them easier to compare across different k. That is supposedly the most useful version for users that experiment with this method.

kno10 commented 5 years ago

I appreciate double checking of the implementations. Some code was contributed by students, and over time we have found many such details. Unfortunately, some papers are also unclear about details, or may even contain two different version under the same name.