adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
2.46k stars 187 forks source link

Feature Request -print0 option #82

Closed jasonblewis closed 7 years ago

jasonblewis commented 7 years ago

some command line tools take null terminated strings of file names as an input. It would be usefule if fdupes could emit the list of dupe files null terminated. For example xargs could make use of this feature. (see xargs options --null or -0)

DEVoytas commented 7 years ago

How would that work? Do you want to have null-separated files in each set of matches, null-separated matching sets or null-separated everything, i.e. all current spaces and new lines replaced by '\0'?

jasonblewis commented 7 years ago

all new lines (between file names) replaced by \0. my recent use case was I dumped the output of fdupes -f to a file for later examining and processing. it would have been useful if the file names were terminated with \0 for easy piping into xargs.

I think it would add a even more versatility to fdupes

DEVoytas commented 7 years ago

So basically you want the same behaviour as -1 --sameline but instead of spaces as separator \0 character would be used. And matching sets would still be separated by new line. Correct ?

jasonblewis commented 7 years ago

I think that all new lines should also be replaced with a \0. Imagine a scenario where file names have \n in them, \0 means you don't have to worry about and piping it into other tools that know about null terminated file name strings will just work

DEVoytas commented 7 years ago

In that case, how would you know which file belongs to which set ? Usually you use fdupes to identify one "original" file and remove duplicates. If you separate everything with a \0, you get one unified list of files and have no way of knowing which file is the same as the other.

jasonblewis commented 7 years ago

doesn't the -f option exclude the first one of each dupe? so no need to know which set is which if you are trying to delete the dupes but leave one original. however I see your point though for other uses. maybe the null option should only work where it makes sense that you'd be sending the entire list to some other tool?

DEVoytas commented 7 years ago

Right, used combined with -f option, it would indeed allow you to keep one copy of the file, but you still do not know which one ('first' is just random file that fdupes happens to find first). Most of the time users want to decide which of the files to keep, so even combined with -f usefulness is rather limited. And regarding the standalone case, there still seem to be no valid use case IMO.

jasonblewis commented 7 years ago

@jbruchon the case where file names might have a \n in them is precisely the case that makes having a null terminated list useful. Agreed it's rare but in the even of needing to de-dupe 100,000s of files it just means the user has to do one less thing, worrying about strange file names.

I agree this is not a common use case, but it is a use case, as I just used fdupes for precisely this problem. 480000 duplicate files due to a software issue. fdupes was very handy for that but i did have to spend some time checking filenames would be safe to pipe into xargs.

llimeht commented 7 years ago

Having files with newlines in the names is a really bad idea anyway, usually only caused by bad shell scripting and programming.

Bug and security databases are littered with cases where incorrect assumptions were made about what was legal in a filename with space and newline being the most commonly missed legal characters. Yes, the shell will split on ' ' and '\n' and failure to handle this can lead to bugs. Yes it's best to avoid such names. At the same time, malicious users seem to have the ability to come up with very creative and interesting ways of exploiting this splitting; standard tools (find, grep, xargs, tar, ...) fix these problems with null separation since null is an illegal character in a POSIX path and it would be great if fdupes could fit better into that ecosystem.

pabs3 commented 7 years ago

I can't seem to convert this issue to a pull request, but please merge the commit I created to implement this feature:

https://github.com/pabs3/fdupes/commit/3dd14ad2ee30dd72ba3fbfc4779488b83a8e8232

DEVoytas commented 7 years ago

@jbruchon made many good points. I guess the main issue with this option is that it's not as easy do define how it should really behave as it is in find case. This is because find returns set of elements (files), while fdupes returns set of sets, with possible extra data ('-S` option).

jasonblewis commented 7 years ago

This turned into something far more challenging that I imagined. I totally see why its difficult, while stilI I think it would be useful for some use cases. Thanks for everyones input and even if you don't implement this feature, it was good to hash out.