fdupes gets stuck - Githubissues

adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.

2.48k stars 186 forks source link

fdupes gets stuck #166

Closed forReason closed 2 years ago

forReason commented 2 years ago

I have tried fduoes on 2 data storage machines. for one machine it gets stuck at 80%, for one it gets stuck at 60%:

sudo fdupes -r /mnt -m
Progress [1396/2304] 60%

omkarium commented 2 years ago

@forReason I observed this same behaviour. Did you find any solution to this? Is this fixed?

forReason commented 2 years ago

nope. I use a different library now.

adrianlopezroche commented 2 years ago

Unfortunately, there's little I can do on my side of things unless I can find a way to reproduce the issue.

omkarium commented 2 years ago

@forReason @adrianlopezroche So, I tested this today with the bunch of files. I tested with 44 GiB where some file sizes are upto 5GiB. I think part of the reason fdupes seemed stuck for me is that it was processing a 5GiB file and for that its taking time. Unfortunately, fdupes progress bar doesn't specify what's the percentage with each file it is working at that particular time. So, I used --maxsize where the limit is set to 1GiB, and skipped all the big ones. fudeps didn't get stuck anymore. Atleast in my case, now I cannot saw that fdupes is stuck even for that 5GiB file. Its just that I could have waited a bit longer.

@forReason why don't you try recreating the issue once more and this time skip any big files? If not, I think it's nice to close this Issue.

forReason commented 2 years ago

@leogitpub, seems reproducible to me the same way. I am trying to dedupe some 200TB filled with 100GB files. there is no chance to skip large files as they are all in the range of 95-105GB each. I wouldn't even need to compare their content. it would be enough to check if any given file exists multiple times on several locations such as on disk 2 and disk 17. (measured by file name, maybe even file size)
My file structure:

Disk 1
{
    file 1
    file 2 (!!!)
    file 3
}
Disk 2
{
    file 4 
    file 5
    file 6
}
Disk 3
{
    file 2 (!!!)
    file 7
    file 8
}

omkarium commented 2 years ago

Hmm, bummer. It would have been great if fdupes had something like --partial-hash just to find duplicates based on the first few bytes. Jdupes seems to give this feature. Yesterday, I tested 907GiB with both fdupes and jdupes, but in both cases I limited the fileSize to 1GiB. It too 3 hours for fdupes to finish and the disk usage was at 15MB/s average. Whereas with jdupes, I used -TT option and it took like 10 minutes barely.

But still, I strongly believe that it seems to look like fdupes is stuck because you are testing it with such HUGE file sizes. You can check your disk usage and it would be doing some 40MB/s. I think fdupes or jdupes is not meant for greater file sizes.

@adrianlopezroche Hi, any plan on adding a --partial-hash feature?

jbruchon commented 2 years ago

@leogitpub Partial hash only matching is extremely dangerous. Don't do it unless you like losing your data. FYI I maintain jdupes, the program referenced, and I put big fat warnings on that option because it's not safe.

@adrianlopezroche One of the things I added to jdupes was a percentage complete indicator for the current hashing or comparison operation so that huge files don't look like a hang/freeze to the user. Feel free to steal my code.

adrianlopezroche commented 2 years ago

Hi, any plan on adding a --partial-hash feature?

I have no intention of adding such a feature, but I do think I should be updating the overall progress indicator not just between files but also as files are read (e.g. as a percentage of total bytes processed).

forReason commented 2 years ago

it it a possibility to add compare files by name only? There is no way, I can scan my whole files with fdupes and there is no reason for scanning on a byte level (for me). My Library slowly but steadily approaches 1PB. My archive files are named in a predictable and unique manner ("Archive-2022-06-27-3:27:22PM"). The danger of file duplicates arises when I add further disks and need to copy around data in order to optimize disk-space.

adrianlopezroche commented 2 years ago

@forReason, a filename-only comparison is beyond the scope of what fdupes is designed to do.

forReason commented 2 years ago

closing this issue, as fdupes is not the correct usecase for my application