adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
2.42k stars 186 forks source link

Feature Request: Option to select the files to delete #20

Open guettli opened 9 years ago

guettli commented 9 years ago

It would be very nice if you could specify which duplicates get deleted.

I want the newer files to get deleted and the oldest should be the one that "survives".

adrianlopezroche commented 9 years ago

Looks like it should be a matter of adding another option to --order / -o to order by time in reverse. On May 26, 2015 6:05 PM, "Jody Bruchon" notifications@github.com wrote:

If you do not specify the -N option but do specify the -d option then you are prompted for the files to delete or keep on a set-by-set basis. Are you wishing to have a modifier for -N which changes it to mean "no prompt before deletion AND delete all but the oldest in each set" ??

— Reply to this email directly or view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/20#issuecomment-105680688 .

adrianlopezroche commented 9 years ago

OK. I just ran fdupes and it looks like the default sort order would do what guettli wants: --order=time (which is the default) sorts from older to newer according to file modification time (mtime). When coupled with -d and -N, the result should be that the oldest file (by mtime) is preserved and the rest are deleted.

adrianlopezroche commented 9 years ago

PS - This applies to the version of fdupes that is currently on git (master). Not all distros have those changes yet, so guettli: please check before attempting to run it on your computer.

guettli commented 9 years ago

Yes, you seem to be right.

I just discovered, that the mtime of my files get changed during import. I use gphoto2 and exiftool to sort images into directories.

Any chance to delete the file ending with -1.jpg?

max@ThinkPad-E520:~$ fdupes -rd ownCloud/Bilder/2015/05/02/
[1] ownCloud/Bilder/2015/05/02/20150502_202508-1.jpg
[2] ownCloud/Bilder/2015/05/02/20150502_202508-0.jpg

Set 1 of 3, preserve files [1 - 2, all]: 

The mtime is the same. I guess gphoto2 does set it according to the exif data.

max@ThinkPad-E520:~$ ls -l ownCloud/Bilder/2015/05/02/20150502_202508-*

-rwxrwxr-x 1 max max 694865 Mai  2 20:25 ownCloud/Bilder/2015/05/02/20150502_202508-0.jpg
-rwxrwxr-x 1 max max 694865 Mai  2 20:25 ownCloud/Bilder/2015/05/02/20150502_202508-1.jpg
Version: fdupes 1.51
adrianlopezroche commented 9 years ago

It looks like you want fdupes to sort by filename, in reverse. I could do it, but it would mean adding more sorting options (alphabetical by itself wouldn't work, as xxx-10.jpg would appear before xxx-2.jpg).

So, four ways of sorting by name: alphabetical ascending/descending, and length of name + alphabetical ascending/descending.

guettli commented 9 years ago

Yes this would be nice. Sorting by length of name is not important for me. But by alphabet would be nice to have. Thank you for listening :-)

adrianlopezroche commented 9 years ago

I see num1 and num2 are 1024 bytes long with no bounds checking. That's a buffer overflow waiting to happen.

The reference to filename length is not because I think it would be useful on its own, but rather because it's one way to handle numeric sorting. Specifically, "naive" numeric sorting can be achieved by sorting first by string length and then by string value, the drawback being that it doesn't handle leading zeroes: a file named, say, xxx-2 would precede a file named xxx-001. That doesn't really bother me, though, as having leading zeroes normally suggests a different naming convention than having no leading zeroes, in which case they're not really meant to be comparable in the first place.

adrianlopezroche commented 9 years ago

I don't trust limits.h. For example, path names in Linux can be larger than PATH_MAX. I doubt the same is true of NAME_MAX, but I'd still want bounds checking.

adrianlopezroche commented 9 years ago

Anyway, I can add it myself, but first I want to make sure I fully understand your numeric_sort function.

adrianlopezroche commented 9 years ago

Comments are fine, but I always look at what the code does rather than what the comments say the code does. That's what I mean by wanting to understand the code.

On Sat, May 30, 2015, 12:07 AM Jody Bruchon notifications@github.com wrote:

I've created a supplementary patch that enlarges the buffers (+6K isn't a big deal these days anyway) and checks for overflow upon each index increment. Revised code also checks an index against the string length before pointer dereferencing. My comparison function basically works by stripping any sequences of zeroes that aren't part of a number, then performing various checks to decide if one number in question is larger or smaller than the other number.The code is heavily commented; an overview is as follows:

  • Loop over both names until the end of one is reached
  • For every char moved past at any point in the process, increment a length counter for later comparisons
  • If any zeroes are encountered in either string prior to a nonzero numeric char in both strings, skip them entirely (assume leading zeroes or a number of no value and don't sort on it at all)
  • If both chars at the current position in each string are numeric, start copying the strings into temporary strings for comparison (this can probably be optimized away)
  • After copying all numeric pairs to temporary strings, if either of the main string pointers is on a numeric char while the other is not, the one that is numeric is automatically a larger number and the sort terminates
  • If the prior check didn't terminate, the numbers are of equal character length, so compare them directly. This while(1) loop could be replaced with strncmp() if desired; I chose to write it out for inline optimization.
  • If the number strings were equal numbers after all, compare the next char pair as strcmp() would, accounting for length increases (for comparison later). Terminate comparison if there is a difference.
  • If a '\0' is hit and the counted length of either string so far is longer than the other, terminate comparison...
  • ...else if one string's pointer isn't on a '\0' terminate sort...
  • ...and fall through with return 0 because the strings are identical.

Patch code for v1->v2:

From 7cde92bc14f21fdcd8a97641b3c9e9576f8186d6 Mon Sep 17 00:00:00 2001 From: Jody Bruchon jody@c02ware.com Date: Fri, 29 May 2015 14:28:09 -0400 Subject: [PATCH 2/2] Minor bug fixes for numeric sort


fdupes.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fdupes.c b/fdupes.c index e82699b..7db2640 100644 --- a/fdupes.c +++ b/fdupes.c @@ -804,31 +804,39 @@ static inline int sort_pairs_by_mtime(file_t f1, file_t f2)

define IS_NUM(a) (((a >= '0') && (a <= '9')) ? 1 : 0)

static inline int numeric_sort(char c1, char c2) {

  • char num1[1024], num2[1024];
  • char num1[4096], num2[4096]; char n1, n2;

int nlen1 = 0, nlen2 = 0

; int i, len;

/* Numerically correct sort */

while (c1 != '\0' && c2 != '\0'

) {

  • /* Reset number length counters */
  • nlen1 = 0; nlen2 = 0; +

/* Skip all sequences of zeroes */

    while (*c1 == '0') {
        nlen1++;
        c1++;
  • if (nlen1 == 4096) goto strlen_overflow; } while (*c2 == '0') { nlen2++; c2++;
  • if (nlen2 == 4096) goto strlen_overflow; } +
  • /* If both chars are numeric, do a numeric comparison */
    if (IS_NUM(*c1) && IS_NUM(*c2)) {
      n1 = num1; n2 = num2;
  •     /* Terminate comparison on non-numeric */
  • /* Terminate comparison on non-numeric chars _/ while (IS_NUM(_c1) && IS_NUM(*c2)) {

/* Copy numbers to strings */

            *n1 = *c1; *n2 = *c2;
            n1++; n2++;
            nlen1++; nlen2++;
  • if (nlen1 == 4096 || nlen2 == 4096) goto strlen_overflow; c1++; c2++; }

@@ -844,10 +852,10 @@ static inline int numeric_sort(char c1, char c2) i = 0; len = (uintptr_t)n1 - (uintptr_t)num1;

  • /* Compare the numbers */
  • /* Compare the number strings */ while (1) {

/* Skip runs of equal digits */

  •         while

    (((num1 + i) == (num2 + i)) && i < len) i++;

  • while (i < len && ((num1 + i) == (num2 + i))) i++;

/* If we run out of digits, numbers are identical */

if (i == len) break

; @@ -862,6 +870,7 @@ static inline int numeric_sort(char _c1, char c2) if (_c1 == c2) { c1++; c2++; nlen1++; nlen2++;

  • if (nlen1 == 4096 || nlen2 == 4096) goto strlen_overflow; } else if (c1 > c2) return 1; else return -1; } @@ -871,7 +880,11 @@ static inline int numeric_sort(char c1, char c2)

if (nlen1 > nlen2) return 1

;

if (c1 == '\0' && c2 != '\0') return -1

;

if (c1 != '\0' && c2 == '\0') return 1

;

  • return 0;

+strlen_overflow:

  • /* If a buffer limit is reached, don't change order */
  • fprintf(stderr, "warning: a number was too long for numeric_sort()\n"); return 0; }

2.2.1

— Reply to this email directly or view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/20#issuecomment-106984480 .

adrianlopezroche commented 9 years ago

Yeah. What I'm saying is I have to review the code itself -- to run it in my head. :)

On Sat, May 30, 2015, 10:49 AM Jody Bruchon notifications@github.com wrote:

Essentially, files are sorted by numeric value first (with leading zeroes ignored), then by name length. If at any point a non-numeric character is introduced in one name but not the other, the numeric one is placed later in the sort. If numbers in names are equal in value, the fallback behavior is essentially the same as strcmp(). Perhaps an output difference will be helpful. This is the prior fdupes strcmp()-based --order=name output compared to the patched code behavior:

$ fdupes --order=name . ./file001 ./file001a ./file002 ./file020 ./file021 ./file030 ./file1 ./file10 ./file100 ./file10a ./file1a2 ./file2 ./file3

$ ../../fdupes --order=name . ./file1 ./file001 ./file1a2 ./file001a ./file2 ./file002 ./file3 ./file10 ./file10a ./file020 ./file021 ./file030 ./file100

— Reply to this email directly or view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/20#issuecomment-107048898 .