Feture request: Data deduplication and Near-deduplication

torronen commented 1 year ago

In current Microsoft.ML developers may need to reduce size of huge datasets (#6679) or at least it might be advisable to do so: For many problems and algorithms hyperparameters tuning is important and may improve results more than increasing size of dataset. Lower datasets make it feasible to complete AutoML tuning. That many of the algorithms are not fully parallizable makes it even more important.

When sampling a dataset I assume there are two key criteria: representability and versatility. The partial sample should represent the dataset as a whole, but it should also include edge cases.

At the moment, I am taking random subsample using SplitTrainTestSet method. I assume it is representative of the whole dataset by virtue of randomness.

However, I would like ways make sure I am not removing edge cases by accident.

Feature requests:

Method to remove duplicates from the dataset
Method to remove near duplicates (inspired by this article https://huggingface.co/blog/dedup )

Question:

Are there ways or methods to analyze subsamples representativeness? Perhaps plotting, or auto-analyzing, the distribution of values in each column. However, this only looks at the columns individually.
- Methods to analyze edge cases not being dropped?

I expect not everything can and should be supported by Microsoft.ML so I will appreciate any ideas and insights on how to ideally complete these tasks within .NET ecosystem. I think data deduplication maybe might be a good addition for Microsoft.ML, and if so, why not near deduplication.

michaelgsharp commented 9 months ago

@luisquintanilla this sounds like something we should add to the backlog to at least keep track of. I'll add it to our "Future" milestone so we can track it.

luisquintanilla commented 9 months ago

I'll add it to our "Future" milestone so we can track it.

I'm fine with this. It's a good idea.

torronen commented 9 months ago

Thanks! This is just a note about noise as I do not know where else to write it. This relates to removal of near-duplicates, and more generally about if lots of data improve performance for current algorithms in Microsoft.ML:

A while ago I thought adding data might not improve performance a lot, and I have been taking samples from my datasets since. However, in cases where there are lots of noise adding max data might be quite helpful. In some cases noise can be removed before training but there seems to be no good rules when to use each method ( smoothening, removal of outliers...). Also, not all of these are possible in live environments. What to do if a live measurement falls into "outlier" category that was excluded from training?

Things related to humans often often seem noisy, at least when working with relatively small number of people. For example, a web shop product might get lots of visitors because someone shared the link on a big social media group with an intriguing title but that does not necessarily indicate more interest in it.

So, representability and versatility, but also something that helps with prevent noise from affecting the results. I am not sure about the correct term. Otherwise, there is risk the model may overfit to noise if we remove too many "near duplicates". At least we should try to guess the least noisy one (average, median?) to keep.

I am still thinking about this topic so my thought are a bit unorganize yet but seems fairly important.

dotnet / machinelearning

Feture request: Data deduplication and Near-deduplication #6700