What to do when image data set is huge ?

yogendra-yatnalkar commented 2 years ago

Hi team, first of all, thanks a lot for the amazing work.

I was working on drift detection for image dataset and used KS drift detection on a small dataset for prototyping. It worked great but I was thinking, if in future I have a dataset containing millions of images, what can be done?

For training the drift artifacts on large datasets, it would not be possible to load these many images into memory. Is there any way to deal with this ?

Please note: I see it can be done using two way but did not find support for the below points. Please let me know if its possible ?

Retraining/updating drift artifacts - Will update on batches of images which will fit into memory

Use TF.Data Pipeline or Pytorch DataLoader directly as input source

arnaudvl commented 2 years ago

Hi @yogendra-yatnalkar , thanks for the kind words. A few thoughts on scaling drift detectors to millions of images:

The marginal benefit of checking for drift on millions vs. a few 100k images is small. So practically you could always randomly subsample your original data.
I am internally testing KeOps to accelerate and scale some of our existing kernel-based drift detectors such as the MMD detector. So far it looks promising and allows effortless scaling to 500k instances in each of the reference and test sets. It can also drastically speed up the detectors on GPU (easily order of magnitude or even more). So we could integrate KeOps as a backend (similar to PyTorch and TensorFlow) to existing detectors such as MMD to deal with larger datasets. Therefore KeOps + random downsampling to a few 100k instances might work in your case.
You are right in your suggestions: if we really need to scale to millions of images, then we would have to pass dataloaders to the detector instead of the current list/np.ndarray format and update the relevant test statistics (e.g. KS or MMD^2) in an incremental/batch-wise format. This might require us rewriting some of the detector's internals. As mentioned before though, the marginal benefit might be rather small.

Please also note that it is typically not recommended to use univariate detectors such as the K-S detector on high-dimensional data such as images, although I am glad it seems to be working well in this case. :)

yogendra-yatnalkar commented 2 years ago

@arnaudvl Thankyou for the detailed explanation. Understood the point that if we need to detect drift, we can sample a big dataset and work with it.

Before closing this thread would like to ask a beginners question, what detector would you prefer for Image related task ? Thanking you in advance.

arnaudvl commented 2 years ago

Definitely not a beginner question since it can be quite tricky! It depends a bit on what you are trying to detect drift on. As explained here, we can understand drift as a change in P(x,y) (with x the input images and y the ground truth) between the reference and test data. This change can happen because P(x) changed (covariate shift), P(y) changed (target drift) or P(y|x) changed (concept drift). Note that multiple types of drift can happen at once! So it's usually a good idea to take that into account when setting up a monitoring system.

We often cannot directly detect changes in P(y) since we typically don't have immediate access to the test data ground truth. We can however proxy this via e.g. detecting drift on the model prediction distribution. So if the model predicts classes (univariate), use the Chi^2 detector, if it predicts a probability distribution, you can use the K-S detector for low dimensions (e.g. binary) or the MMD or LSDD detectors. The latter two also have online equivalents (MMD online and LSDD online). Check the benefits of online detectors here. Another useful detector on model predictions is the uncertainty drift detector, which can serve as a proxy for model performance deterioration. Note that these methods do not rely on the data modality of the input, just the model predictions.

To detect possible changes in P(x) where x consists of images, we typically want to apply a dimensionality reduction step first. More on that here. For images this can be for instance the encodings from a pretrained autoencoder. This notebook contains a worked example on medical image data doing just that. Then almost any detector can be used (e.g. MMD or LSDD) on the encodings. Alternatively, if you don't want this encoding step, you can directly train a domain classifier which learns to distinguish the reference from the test set or use a (deep) learned kernel. An example for both on images can be found here.

Lastly, if you do have access to the labels, then you can directly apply a supervised drift detector such as the Cramer-von Mises detector or Fisher's Exact Test. These methods again don't rely on the data modality of the input, just on the labels.

SeldonIO / alibi-detect

What to do when image data set is huge ? #510