Faster/More efficient duplicate removal for exact/fuzzy dedup.

NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs

Apache License 2.0

632 stars 85 forks source link

Is your feature request related to a problem? Please describe. The current deduplication examples suggest compute on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed list to filter out input documents. This doesn't work in cases where the duplicate list is too large and doesn't fit on the client. Ideally curator can provide additional classes/methods to remove duplicates from the list of duplicates list more efficiently.

Describe the solution you'd like A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first. Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

Describe alternatives you've considered N/A

Additional context The Zyda-2 tutorial and pre-training data tutorial both contain alternate approaches to compute since it's memory intensive.

A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first.

Examples of removing using merge:

https://gist.github.com/VibhuJawa/7c780209bdcad9ac7615bd84b86cde58

Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.

My best suggestion here if we want to skip doing a broadcast merge is to do a batched index merge (like we do in the CC stage) for this, i think thats most scalable.

We can also do this based on a heuristic as the list of IDs is all ready in distributed GPU memory, so we can switch b/w the two, this will mean we dont compromise on perf on the short term

We can also play tricks on filtering if we wanted to by first creating a map of file->dataset_ids and then doing a filter on a dataset.

https://gist.github.com/VibhuJawa/749045921b9e5a81b42a4b41cd4b03dc

NVIDIA / NeMo-Curator

Faster/More efficient duplicate removal for exact/fuzzy dedup. #335