Add Fuzzy Matching Feature to Enhance Duplicate Detection

anushka4124 commented 3 weeks ago

I would like to propose a new feature to enhance the duplicate detection capabilities of TwinTrim by introducing fuzzy matching technique. Currently, the tool relies on strict hashing to identify duplicate files, which works well for exact duplicates. However, many users might encounter near-duplicate files—files that are not identical but share high content similarity (e.g., different versions of documents, edited images).

The fuzzy matching feature would:

Use similarity metrics (such as Jaccard index or Levenshtein distance) to detect files with slight differences.
Help identify near-duplicates, reducing redundancy and improving storage efficiency.
Allow users to set a similarity threshold to control the sensitivity of the fuzzy matching.

I have already implemented an initial version of this feature by creating a new fuzzy.py file. This includes a method _find_fuzzyduplicates that uses fuzzy matching to detect near-duplicates.

If you like the idea, please assign me the task of developing upon this feature @Kota-Karthik

Kota-Karthik commented 3 weeks ago

@anushka4124 It sounds good You can make changes and try to retain folder structure , Also add a separate flag --fuzzy-duplicates But remember this flag shouldn't be taking much time for scanning and finding fuzzy duplicates @techy4shri Please add necessary labels

anushka4124 commented 3 weeks ago

@anushka4124 It sounds good You can make changes and try to retain folder structure , Also add a separate flag --fuzzy-duplicates But remember this flag shouldn't be taking much time for scanning and finding fuzzy duplicates @techy4shri Please add necessary labels

I have already made a PR having done all the changes that were required for the Fuzzy Duplicate File Detection. Please have a look at it! @Kota-Karthik

techy4shri commented 3 weeks ago

@anushka4124 you cannot make a PR wihout being assinged to an issue first, I will be rejecting that PR. Make a new PR and follow the template to ensure it gets reviewed and merged successfully.

anushka4124 commented 3 weeks ago

Thank you so much for assigning this task to me!

I have generated another pull request, kindly check that once @Kota-Karthik @techy4shri

Kota-Karthik / twinTrim

Add Fuzzy Matching Feature to Enhance Duplicate Detection #84