NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Move common dedup utils and remove unused code #42

Closed ayushdg closed 2 months ago

ayushdg commented 2 months ago

This PR remove the nemo_curator/gpu_deduplication folder in favor of using all code from either the fuzzy_dedup module or fuzzy_dedup_utils. (A few methods were left behind).

It also moves the fuzzy dedup scripts into a new subfolder with a readme on the order of execution and example usage. It adds a caution to the gpu_deduplication slurm example currently in examples which will be removed in a followup and replaced by a python only API example.