NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
567 stars 77 forks source link

removing fuzzy duplicates bug in single node tutorial #210

Closed yyu22 closed 1 month ago

yyu22 commented 2 months ago

Describe the bug In single_gpu_tutorial.ipynb, the code for removing fuzzy duplicates is not removing the duplicates.

Steps/Code to reproduce bug

code block 84 and 85 in single_gpu_tutorial.ipynb:

#Loads result from fuzzy dedup wrapper
fuzzy_duplicates = pd.read_parquet(fuzzy_dedup_output_dir)

#Generate list of near duplicate document ID
fuzzy_docs_to_remove = fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first')

#Remove near duplicates
result = result[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])]

fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first') is dropping the duplicates except the first occurrence of duplicates in each group. So later in result[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])], the duplicated documents are not being removed. only the first occurrence of duplicates in each group is removed.

Expected behavior

fuzzy_docs_to_remove should retain the duplicates not the first occurrence of duplicates:

fuzzy_duplicates = fuzzy_duplicates[fuzzy_duplicates.duplicated(subset=['group'], keep='first')]

arhamm1 commented 1 month ago

@nicoleeeluo Can you help triage this?