Feature: Add support for finding duplicates

Worgle123 commented 1 year ago

The Idea:

I have pretty massive numbers of files, and I have been constantly frustrated about the default duplicate finder, which only appears to find duplicate names. I would love to see this feature find it's way to Files, as at the moment, I have to use paid clunky software, which doesn't even appear to do a great job. Something that had a look at file contents or even just size would be amazing!

How I should work.

The feature should have several layers of scanning. It could be initiated either automatically if certain parameters (like name, or instead, name and then size) were activated, or within the right click menu of a folder/disk, or maybe within the right click menu if a user had selected 1 or more files (this would mean it would only scan for duplicates of the selected files).

It could start by finding all files with a not just identical, but similar name, and then also be set up to do a more resource intensive scan of file sizes find duplicates. It would only progress to scanning file sizes after it had found the ones with similar names. It would only scan the ones with similar names, and could have adjustable sensitivity to reduce it's performance impact. Maybe a system could automatically set this to a recommended sensitivity depending on file types. To give an example, when searching only through images, you generally have a decent sized difference between images, so sensitivity could be turned down. When looking through text documents though, they tend to have a relatively consistent file size, so close to maximum sensitivity would be needed. It may (if it is not to hard) be able to pick a recommended average score if there is some difference in file types. Even if it would not work if there was a large difference, it would help with the process if, say, there were just 1 or 2 rogue files in a scan. This would mean the system would not break just for the sake of 1 or 2 files.

It would also have to be able to judge whether they are in the same format or not, as otherwise there could be the accidental clash of (for an example: CR3 and .JPEG files).

If it found any matches, it could then progress to scanning the actual contents of the file, or (probably better) just opening it in a separate window for the user to judge the similarity. After the scan was completed, it would then open a grid/list area of all located duplicates, (where you could check/uncheck suspected duplicates) and ask the user for a next step (eg. delete/rename/move to a different location).

In a Nutshell

Will add an advanced duplicate finder.

Files Version

2.0.31.0

Windows Version

Windows 11, version 22H2

Comments

Could be both auto (when a file with the same name is added to a folder) and have an option to scan specific file areas.

Worgle123 commented 1 year ago

Maybe it could also check for file size similarities if the names were not just identical, but similar?

cinqmilleans commented 1 year ago

This is a very useful feature that brings real added value. Unfortunately, having tried several, it seems too complex to me to integrate. It takes too many options to handle. You have to display millions of results and keep a user-friendly interface. Stains can take several hours. It is software in itself, but unfortunately there are no good free ones.

Worgle123 commented 1 year ago

This is a very useful feature that brings real added value. Unfortunately, having tried several, it seems too complex to me to integrate. It takes too many options to handle. You have to display millions of results and keep a user-friendly interface. Stains can take several hours. It is software in itself, but unfortunately there are no good free ones.

Interesting that it would be so complex. Maybe further down the road though? It could be improved over time, and slowly built up, with more feautures. Any improvement would really be very welcome here! Remember, great things come from large amounts of time and effort. Thanks!

cinqmilleans commented 1 year ago

Of course, but you have to be aware of the work to be done. You will need a powerful database. You shouldn't interfere with the rest of the application either (in terms of performance and UX). Maybe it would be better to make it a spinoff app.

manfromarce commented 1 year ago

It may be reasonable in my opinion to implement only manual scans via a context menu item that opens a separate window (such as Properties) and limit to one simultaneous scan. In this way it would be helpful but would't impact performance and the app in general too much as automatic background scans would do. It would also be less complex, e.g. I don't think CCleaner's duplicates finder mantains a database.

Worgle123 commented 1 year ago

It may be reasonable in my opinion to implement only manual scans via a context menu item that opens a separate window (such as Properties) and limit to one simultaneous scan. In this way it would be helpful but would't impact performance and the app in general too much as automatic background scans would do. It would also be less complex, e.g. I don't think CCleaner's duplicates finder mantains a database.

Just saw this now (6 March) I agree that 1 simultaneous scan would be a better proposal. How would you propose it would scan for duplicates? I thought that It could have a size filter, and it could have varying levels of sensitivity. Maybe it could even set a computer proposed sensitivity.

Thanks!

Josh65-2201 commented 1 year ago

It could be possible to use a hash value to check. but that would only find exact duplicates in file data.

manfromarce commented 1 year ago

It could offer different toggeable options such as: ✅ exact duplicates ✅ same extension, similar size ✅ same extension, similar name and size but I don't know what would be a good default tolerance

Worgle123 commented 1 year ago

It could be possible to use a hash value to check. but that would only find exact duplicates in file data.

Could you explain this please? Excuse my ignorance, but what are hash values?

Worgle123 commented 1 year ago

It could offer different togglable options such as: ✅ exact duplicates ✅ same extension, similar size ✅ same extension, similar name and size but I don't know what would be a good default tolerance

I agree with you. As for tolerance, maybe it could have different sensitivities for separate extensions? .docx .txt and other such files would need more accurate scans, as their size is generally pretty similar, but with images, sizes tend to vary more. I believe that I already suggested something similar in the original suggestion.

Jay-o-Way commented 1 year ago

For those who need it: CCleaner has this function. And i bet a number of others will have it too.

yaira2 commented 1 year ago

It might be useful to notify the user of duplicates in the downloads folder.

Worgle123 commented 1 year ago

Yeah. I agree. That's where all my storage is :/

Sent via Superhuman ( @.*** )

On Wed, May 31, 2023 at 05:43:45, Yair < @.*** > wrote:

It might be useful to notify the user of duplicates in the downloads folder.

— Reply to this email directly, view it on GitHub ( https://github.com/files-community/Files/issues/11385#issuecomment-1569144507 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/A5NLPEIS6ICE3X6GRPAHQCTXIZSZDANCNFSM6AAAAAAVCOZ76Q ). You are receiving this because you authored the thread. Message ID: <files-community/Files/issues/11385/1569144507 @ github. com>

files-community / Files