arsenetar / dupeguru

Find duplicate files
https://dupeguru.voltaicideas.net
GNU General Public License v3.0
5.42k stars 415 forks source link

Multi-core enable plain duplicate check (SSD is underutilized) #285

Open shoffmeister opened 9 years ago

shoffmeister commented 9 years ago

It might be an interesting idea to enable scanning on multiple threads on SSDs.

A scan on my "Documents" directory on Windows 8.1 was only utilizing I/O bandwidth of about 250 MB/s - while the SSD itself would be able to deliver up to 500 MB/s reads. At the same time, CPU utilization was only 18%.

This would indicate that the algorithm in place for gathering does not perform a divide-an-conquer approach; it would be possible to have, for instance, a "file to read" (hash?) queue, and then have multiple CPU threads consume the entries scaling up to .

As SSDs do not have any seek costs, this will max out either the CPUs or the I/O subsystem (most likey the I/O subsystem).

Doing this on rotating media might not be a good idea - although for a massive RAID configuration, or for something going over GigE or fiber to some remote SAN / NAS ...?

SEVENID commented 4 years ago

Duplicate of #240 ?