Closed forReason closed 2 years ago
@forReason I observed this same behaviour. Did you find any solution to this? Is this fixed?
nope. I use a different library now.
Unfortunately, there's little I can do on my side of things unless I can find a way to reproduce the issue.
@forReason @adrianlopezroche So, I tested this today with the bunch of files. I tested with 44 GiB where some file sizes are upto 5GiB. I think part of the reason fdupes seemed stuck for me is that it was processing a 5GiB file and for that its taking time. Unfortunately, fdupes progress bar doesn't specify what's the percentage with each file it is working at that particular time. So, I used --maxsize where the limit is set to 1GiB, and skipped all the big ones. fudeps didn't get stuck anymore. Atleast in my case, now I cannot saw that fdupes is stuck even for that 5GiB file. Its just that I could have waited a bit longer.
@forReason why don't you try recreating the issue once more and this time skip any big files? If not, I think it's nice to close this Issue.
@leogitpub, seems reproducible to me the same way.
I am trying to dedupe some 200TB filled with 100GB files. there is no chance to skip large files as they are all in the range of 95-105GB each. I wouldn't even need to compare their content. it would be enough to check if any given file exists multiple times on several locations such as on disk 2 and disk 17. (measured by file name, maybe even file size)
My file structure:
Disk 1
{
file 1
file 2 (!!!)
file 3
}
Disk 2
{
file 4
file 5
file 6
}
Disk 3
{
file 2 (!!!)
file 7
file 8
}
Hmm, bummer. It would have been great if fdupes had something like --partial-hash just to find duplicates based on the first few bytes. Jdupes seems to give this feature. Yesterday, I tested 907GiB with both fdupes and jdupes, but in both cases I limited the fileSize to 1GiB. It too 3 hours for fdupes to finish and the disk usage was at 15MB/s average. Whereas with jdupes, I used -TT option and it took like 10 minutes barely.
But still, I strongly believe that it seems to look like fdupes is stuck because you are testing it with such HUGE file sizes. You can check your disk usage and it would be doing some 40MB/s. I think fdupes or jdupes is not meant for greater file sizes.
@adrianlopezroche Hi, any plan on adding a --partial-hash feature?
@leogitpub Partial hash only matching is extremely dangerous. Don't do it unless you like losing your data. FYI I maintain jdupes, the program referenced, and I put big fat warnings on that option because it's not safe.
@adrianlopezroche One of the things I added to jdupes was a percentage complete indicator for the current hashing or comparison operation so that huge files don't look like a hang/freeze to the user. Feel free to steal my code.
Hi, any plan on adding a --partial-hash feature?
I have no intention of adding such a feature, but I do think I should be updating the overall progress indicator not just between files but also as files are read (e.g. as a percentage of total bytes processed).
it it a possibility to add compare files by name only? There is no way, I can scan my whole files with fdupes and there is no reason for scanning on a byte level (for me). My Library slowly but steadily approaches 1PB. My archive files are named in a predictable and unique manner ("Archive-2022-06-27-3:27:22PM"). The danger of file duplicates arises when I add further disks and need to copy around data in order to optimize disk-space.
@forReason, a filename-only comparison is beyond the scope of what fdupes is designed to do.
closing this issue, as fdupes is not the correct usecase for my application
I have tried fduoes on 2 data storage machines. for one machine it gets stuck at 80%, for one it gets stuck at 60%: