cloud-py-api / mediadc

Nextcloud Media Duplicate Collector application
https://apps.nextcloud.com/apps/mediadc
GNU Affero General Public License v3.0
94 stars 7 forks source link

Delete all duplicates #94

Open needsupport opened 1 year ago

needsupport commented 1 year ago

Description I have 3000 duplicates. deleting them one by one is going to take forever. Can you add a delete all duplicates for this task button ? Example

2br-2b commented 1 year ago

The main thing that I (not as a developer of this project) can see would need to be addressed is: how would you choose which file to keep and which to delete? Date modified, name, folder, etc.? I had been thinking about suggesting this, but I wanted to think thru these options more before before I said anything.

bigcat88 commented 1 year ago

We are working on it, we have one interesting idea =)

2br-2b commented 1 year ago

We are working on it, we have one interesting idea =)

Sounds great! I'm looking forward to this! 😄

Also, just a note for future reference, this is probably a duplicate of #75

DrMurx commented 1 year ago

Before we think about a complex way to determine which of the duplicates are to be deleted automatically, it would already be helpful if:

  1. all duplicate groups on the current page open with the click of a single button
  2. all duplicates with their checkboxes ticked across all duplicate groups can be deleted with a single click.
teemue commented 1 year ago

Also an option to select "safe folder(s)": i.e. if you are comparing folders A, B, C, marking A as safe won't let you delete any files from that folder or it's subfolders.

ksihota commented 1 year ago

This is a necessity for me. I like the idea of a safe directory or directories. I suggested an alternate layout in Issue #75 to display and select duplicates for deletion

forgi007 commented 1 year ago

A possible auto-delete option might be to delete all the smaller sized files from each group (and preserve only the one with the largest size). It helps especially if you have imported images from google and you also have the originals which are bigger in size.

An interactive threshold slider would be also nice to fine tune it before auto-deletion: if I see the highest differences in a few groups (in the ones with the largest differences) then I could fine tune the threshold before auto deletion to my taste (again the goal would be to preserve just one image per group).

Thank you for the great work guys, you are great, keep on going!

RedSackles commented 1 year ago

I was about to put in a similar request as I indeed have thousands of match-groups to work thru and when set to 99% threshold there don't seem to be that many false matches. At least in my use case keeping the largest of the files seems like a fine way to determine which files to delete. I am really looking forward to this functionality.

hunluki commented 1 year ago

Just registered to let you know: Please make this happen. I'm currently looking over 5000 matching groups. I don't want to :)

KaiOnGitHub commented 1 year ago

+1

mniami commented 1 year ago

@bigcat88 any help's needed?

bigcat88 commented 1 year ago

@bigcat88 any help's needed?

We are currently working full-time(even more then full-time) on project called "Application Ecosystem V2"(their repos currently in this cloud-py-api org) for the Nextcloud and unfortunately, we don't have the time to implement a specific feature request at the moment. However, we would be more than happy to accept a pull request if someone from the community would like to contribute and implement the feature themselves. Alternatively, if you can wait until we finish working(minimum time required one month, but probably it will be 2 months) on the "Application Ecosystem V2" we will be able to return to mediadc and consider implementing your requested feature. If all goes well we will start rewriting MediaDC python part in parallel when will finish design state and publish docs for the AEv2. Thank you for understanding.

mniami commented 1 year ago

@bigcat88 In that case there is no much sense to support you guys in implementing some workarounds, please let us only know about the progress, it's really useful feature, thx

mniami commented 1 year ago

@bigcat88 how is the thing with Ecosystem V2? The feature of removing all duplicates at once is really interesting :) please go back to it

bigcat88 commented 1 year ago

@mniami Fast, but slower than I expected. If specifically on MediaDC, then I hope that it will be easy for me to add support for applications written with AppEcosystem to AppStore and in 2-3 weeks it will be possible to start transferring MediaDC to AppEcosystem.

But it's still not accurate, so far everything is going well, but there can always be some kind of obstacle.

mniami commented 1 year ago

@bigcat88 thanks for letting us know

jeliastam commented 8 months ago

Checking in!

tbarbette commented 7 months ago

I made a python script that uses the json export meanwhile :

https://github.com/tbarbette/mediadc-massdelete/tree/main

It uses the size of files first to keep the biggest one, then a few heuristics in the filenames you can give if the size match (for instance, delete everything with "whatsapp" in the path, and prefer not to delete anything with DCIM in the path). I also found the option --different-path-only useful to avoid deleting pictures from a burst, it will not delete files in the same folder. In general you remove duplicate because a mess was created by different folders with the some similar pictures, smaller version created by whatsapp that were imported, thumbnails, .... Hope it helps.

intcreator commented 2 weeks ago

another heuristic/option that would be nice: ability to delete all duplicates only if the file sizes are the same. sometimes even the 100% matching setting doesn't actually identify 100% of the visual match but exact same file size should be an obvious indicator of a duplicate along with that

I have 72,000 duplicates so I will be eagerly anticipating this feature

mniami commented 2 weeks ago

Hey, guys, size and checksum would be appropriate for the comparison.

śr., 9 paź 2024, 21:56 użytkownik Brandon der Blätter < @.***> napisał:

another heuristic/option that would be nice: ability to delete all duplicates only if the file sizes are the same. sometimes even the 100% matching setting doesn't actually identify 100% of the visual match but exact same file size should be an obvious indicator of a duplicate along with that

— Reply to this email directly, view it on GitHub https://github.com/cloud-py-api/mediadc/issues/94#issuecomment-2403319271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADM5ZXJDAV2LVWGHNKQRFMDZ2WC63AVCNFSM6AAAAABPVIWWQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBTGMYTSMRXGE . You are receiving this because you were mentioned.Message ID: @.***>

mniami commented 2 weeks ago

You can always check the creation date as well.

czw., 10 paź 2024, 08:31 użytkownik Damian Szczepański @.***> napisał:

Hey, guys, size and checksum would be appropriate for the comparison.

śr., 9 paź 2024, 21:56 użytkownik Brandon der Blätter < @.***> napisał:

another heuristic/option that would be nice: ability to delete all duplicates only if the file sizes are the same. sometimes even the 100% matching setting doesn't actually identify 100% of the visual match but exact same file size should be an obvious indicator of a duplicate along with that

— Reply to this email directly, view it on GitHub https://github.com/cloud-py-api/mediadc/issues/94#issuecomment-2403319271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADM5ZXJDAV2LVWGHNKQRFMDZ2WC63AVCNFSM6AAAAABPVIWWQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBTGMYTSMRXGE . You are receiving this because you were mentioned.Message ID: @.***>