adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
2.42k stars 186 forks source link

Questions about contributing #34

Open clara-j opened 8 years ago

clara-j commented 8 years ago

I am currently working on some changes to fdupes to make it work a bit better with the btrfs de-duplication process duperemove.

I have added some new arguments to limit fize considered based on file size and also skip the final byte to byte verification (btrfs will do this anyway as part of its de-duplication phase).

My question is if there is any specific guidelines I should follow to help with my pull requests being accepted, or things I should avoid?

I also made some minor tweaks to speed things up slightly and to make better use of the getfilestats function.

clara-j commented 8 years ago

Hi, what I built so far is three new options, but seems like you have done most of those changes.

-b limit files considered based on min size (file must be greater than value) -B limit files considered based on max size (file must be less than value) -e to skip byte to byte verification at end

Mine is made to be compatible since the output is passed directly to duperemove so changes to the output could great that flow.

adrianlopezroche commented 8 years ago

Hello Clara,

I like pull requests that deal with a single feature at a time. The reason is there are certain features that don't interest me, such as the option to skip byte-to-byte verification.

As Jody Bruchon has observed, it's been a while since I've accepted any non-trivial pull requests, mostly because I'm very careful about what to include in fdupes (trying to avoid feature bloat, trying to ensure safety, etc.) and can't quite get up the energy to work on it during my time off work.

Having said that, I haven't abandoned fdupes and I'll gladly take a look at your -b and -B options.

Thanks.

clara-j commented 8 years ago

Thank you for the reply. If there is anything you require that I adjust to have you accept the pull let me know.

I hope you can revisit the skipping of the byte verification since it is very useful as an optimal setting for cases where the results will be verified down stream as is the case with BTRFS

Other changes that I was going to add is an option to prevent the process from going into other devices when in recursive mode at the request of someone else in the BTRFS tools thread. That one should be very simple as most of the code is already there for it already to handle hard link checking.

PS thanks for the work on the tool as either way it is a great starting point for what I wanted.

vsiegel commented 6 years ago

Hello @adrianlopezroche, there is a comparison of the md5 hashes before the byte by byte comparison, right? I think that byte comparison is not necessary. Not in the sense "it is not really important", but in the sense "it adds no practical value" - based on the probability of md5 hash collisions. I do not have the numbers at hand (let me know in case you calculated the collision probability).

I could check with the encryption experts or statisticians on stackexchange if that would be helpful. Assuming I find convincing support for my assumption, I would propose to skip the byte to byte comparison entirely.

Does that make any sense to you, or are you already sure for whatever reason, that you - independent of the probability of a hash collision - do not want that change?

The relevant stackexchange sites are: https://crypto.stackexchange.com/ https://stats.stackexchange.com/ https://stackoverflow.com/ (My background is not statistics, but a master in computer science)

adrianlopezroche commented 6 years ago

MD5 is considered broken. Collisions have already been found. I will not get rid of the byte-for-byte comparison.

https://www.mscs.dal.ca/~selinger/md5collision/

On Mon, Feb 12, 2018, 2:04 AM vsiegel notifications@github.com wrote:

Hello @adrianlopezroche https://github.com/adrianlopezroche, there is a comparison of the md5 hashes before the byte by byte comparison, right? I think that byte comparison is not necessary. Not in the sense "it is not really important", but in the sense "it adds no practical value" - based on the probability of md5 hash collisions. I do not have the numbers at hand (let me know in case you calculated the collision probability).

I could check with the encryption experts or statisticians on stackexchange if that would be helpful. Assuming I find convincing support for my assumption, I would propose to skip the byte to byte comparison entirely.

Does that make any sense to you, or are you already sure for whatever reason, that you - independent of the probability of a hash collision - do not want that change?

The relevant stackexchange sites are: https://crypto.stackexchange.com/ https://stats.stackexchange.com/ https://stackoverflow.com/ (My background is not statistics, but a master in computer science)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adrianlopezroche/fdupes/issues/34#issuecomment-364835671, or mute the thread https://github.com/notifications/unsubscribe-auth/AF8J_ZSGgxnuiKxckA3s1gYCp1qa-jklks5tT9RxgaJpZM4F-XKL .