elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
777 stars 321 forks source link

Unsupervised models and prediction #99

Open davnn opened 2 years ago

davnn commented 2 years ago

Hi, I would like to use ELKI to validate our OutlierDetection.jl algorithms. It appears that for unsupervised algorithms there is no facility to learn a model from data and predict on another dataset, right? Has this been done somewhere previously?

kno10 commented 2 years ago

Most unsupervised learning does not consider prediction as part of the task. For the simplest models, such as k-means, or kNN, prediction is possible. But for most it is not possible in a consistent way. This includes common models such as the local outlier factor. E.g., sklearn implements a "predict" function for LOF, but it will "predict" different values than the local outlier factor would usually assign to these points. Similarly, affinity propagation is a closed-world approach, and "predicting" using the nearest neighbor (as done in sklearn if I recall correctly) yields worse results and ignores much of the algorithms idea; instead it mixes it with a k-means-like approach (and then the algorithm ends up performing not much better than k-means).

I suggest to not approach these algorithms from a supervised "fit then predict" mindset at all, it does not do them justice.

davnn commented 2 years ago

Thanks a lot for your feedback!

I suggest to not approach these algorithms from a supervised "fit then predict" mindset at all, it does not do them justice.

I agree that a post-hoc nearest-neighbors approach to enable prediction does not feel right and does not do the algorithms justice. However, I'm mainly interested in unsupervised outlier detection algorithms that can be used for prediction fairly naturally. As far as I can tell, this includes most of the neighbor-based approaches, where you can simply compare a prediction to the points in the training set.

Edit: I also wouldn't say, for example, that the prediction disturbs the meaning of the LOF score.

kno10 commented 2 years ago

It doesn't disturb the meaning, but the results of prediction will not agree with the results of the original algorithm. So it is some extension, there is more than one way to do this, and it's not part of the published work.

kno10 commented 2 years ago

And in many cases it will even be better to depart, e.g., from LOF and try to cleanly transfer the key idea into an open-world "train-test split setting", rather than hack this into the original closed-world setting. This makes things both a lot easier and more efficient than trying to stay close to the original LOF. The problem is, there is no public data set suitable for evaluating any of this in a meaningful way.

kno10 commented 2 years ago

@davnn for implementing a set of outlier detection algorithms in Julia, it may be worth trying to implement the generalized pattern introduced here:

Schubert, E., Zimek, A. & Kriegel, HP. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Disc 28, 190–237 (2014). https://doi.org/10.1007/s10618-012-0300-z

This avoids redundancies when implementing this. Its not very elegantly possible in Java right now though, because generics cannot be primitives; and most of the time we will be dealing with primitive values (e.g., a local density estimate). There is a partial implementation of this pattern in ELKI for the parallelized versions, but the code has to work around a lot of Java limitations. But for a Julia implementation, this may be an interesting starting point to approach it as a multi-stage mapping process, with these stages shared across different detectors.

davnn commented 2 years ago

Thanks for the hint! I've already used that paper as a reference for the different approaches a couple of times ;). I'll investigate how those generalizations could be implemented further down the line.

kno10 commented 2 years ago

If you used that paper as a reference, I would appreciate if it were included in the references list of the package. ;-)