SeldonIO / alibi-detect

Algorithms for outlier, adversarial and drift detection
https://docs.seldon.io/projects/alibi-detect/en/stable/
Other
2.23k stars 223 forks source link

[Question] Can I use VAE when my training data contains both inlier and outlier images without a label? #561

Closed Bennievdbuurt closed 2 years ago

Bennievdbuurt commented 2 years ago

Does the VAE model work when I train it on both inlier and outlier images, whilst labeling of inlier versus outlier images hasn't been done? In other words, is the VAE model able to distinguish inlier and outlier images already during the training stage or does it consider all the images on which it has been trained as inlier images?

If not, are there alternatives for training an outlier detection model (for example clustering)?

Thanks.

arnaudvl commented 2 years ago

Hi @Bennievdbuurt . While it could be fine if you only have a very limited amount of outliers during the training stage, it is not recommended because the VAE model is not able to distinguish the inlier and outlier images during training and considers all images as inlier images. This is the common practice for the various outlier detectors in the library.

Bennievdbuurt commented 2 years ago

Many thanks for your quick response. Basically I am in the "1. Unsupervised Outlier Detection" situation described on this website of PyOD (https://pyod.readthedocs.io/en/latest/relevant_knowledge.html). Unfortunately, this library doesn't support image outlier detection. Do you have any recommendations on how to proceed using alibi-detect or any other repo?

arnaudvl commented 2 years ago

Realistically, it will be a very tough problem without knowing more about what your outliers would look like and how many you could expect. Assume the following setup: you apply some sort of dimensionality reduction step (e.g. via a hidden layer of your model, or a self-supervised representation such as SIMCLR) followed by a clustering algorithm. You have a bunch of instances which could be inliers or outliers. So when could you expect this setup to work reasonably well? Realistically only when your outliers look a lot more like each other rather than the inliers. This is often not the case though. Consider the case where we have 10 classes and some outliers for each class. These outliers might still look a lot more like inlier instances from their respective class compared to the instances from the other classes and be clustered together with the normal instances. So the outliers need to already look very different from the inliers of all classes for this to work. On top of that, you also need to have a rough idea how many different types of outliers you could expect to set the number of clusters correctly. Assume that there is some diversity within a certain class (e.g. you have images of a class taken in very different lighting conditions), in this case when you increase the number of clusters (initially intended to capture the outliers) you might just be splitting the cluster of normal instances which are fairly diverse instead of the outliers. It's very hard to tell in advance and give a general recommendation. Given that you already need to have quite a lot of domain-specific knowledge to build a successful unsupervised outlier detector which takes both inliers and outliers at the same time, you might be better of setting domain-specific rules or simply labelling a few instances and turning the problem at least into a semi-supervised one.

ascillitoe commented 2 years ago

Closing for now. Please feel free to open if any follow-up questions!