Open ayushdg opened 6 days ago
A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first.
Examples of removing using merge:
https://gist.github.com/VibhuJawa/7c780209bdcad9ac7615bd84b86cde58
Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.
My best suggestion here if we want to skip doing a broadcast merge is to do a batched index merge (like we do in the CC stage) for this, i think thats most scalable.
We can also do this based on a heuristic as the list of IDs is all ready in distributed GPU memory, so we can switch b/w the two, this will mean we dont compromise on perf on the short term
We can also play tricks on filtering if we wanted to by first creating a map of file->dataset_ids and then doing a filter on a dataset.
Is your feature request related to a problem? Please describe. The current deduplication examples suggest
compute
on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed list to filter out input documents. This doesn't work in cases where the duplicate list is too large and doesn't fit on the client. Ideally curator can provide additional classes/methods to remove duplicates from the list of duplicates list more efficiently.Describe the solution you'd like A broadcast merge approach like the one suggested by @VibhuJawa works good enough at the 4-8TB scales where the duplicate list is small enough to be broadcasted to each worker and is worth implementing first. Longer term there might be a need for smarter partitioning of the duplicate list so that different files/subset can handle their own list of duplicates differently.
Describe alternatives you've considered N/A
Additional context The Zyda-2 tutorial and pre-training data tutorial both contain alternate approaches to
compute
since it's memory intensive.