Describe the bug
In single_gpu_tutorial.ipynb, the code for removing fuzzy duplicates is not removing the duplicates.
Steps/Code to reproduce bug
code block 84 and 85 in single_gpu_tutorial.ipynb:
#Loads result from fuzzy dedup wrapper
fuzzy_duplicates = pd.read_parquet(fuzzy_dedup_output_dir)
#Generate list of near duplicate document ID
fuzzy_docs_to_remove = fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first')
#Remove near duplicates
result = result[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])]
fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first') is dropping the duplicates except the first occurrence of duplicates in each group. So later in result[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])], the duplicated documents are not being removed. only the first occurrence of duplicates in each group is removed.
Expected behavior
fuzzy_docs_to_remove should retain the duplicates not the first occurrence of duplicates:
Describe the bug In single_gpu_tutorial.ipynb, the code for removing fuzzy duplicates is not removing the duplicates.
Steps/Code to reproduce bug
code block 84 and 85 in single_gpu_tutorial.ipynb:
fuzzy_duplicates.drop_duplicates(subset=['group'], keep='first')
is dropping the duplicates except the first occurrence of duplicates in each group. So later inresult[~result[input_id_field].isin(fuzzy_docs_to_remove[input_id_field])]
, the duplicated documents are not being removed. only the first occurrence of duplicates in each group is removed.Expected behavior
fuzzy_docs_to_remove
should retain the duplicates not the first occurrence of duplicates:fuzzy_duplicates = fuzzy_duplicates[fuzzy_duplicates.duplicated(subset=['group'], keep='first')]