birkenfeld / fddf

Fast data dupe finder
Apache License 2.0
109 stars 9 forks source link

Low throughput #6

Open StyMaar opened 6 years ago

StyMaar commented 6 years ago

I'm running fddf on Debian Jessy, and the I/O read (shown by iotop) never goes 3MB/s. The tasks isn't CPU bound either, ~25% on both two cores. By comparison, ls -R reads between 10 and 15 MB per seconds, so does rsync on the same workload.

The directory I'm running fddf on contains a lot of small files (text files), a big amount of medium files (pictures or mp3) and a decent number of big files (movies or .iso images).

I have no idea how file I/Os work on Linux, then I don't know how to speed this up.

StyMaar commented 6 years ago

I'm running fdupes right now to have an «apple to apple» comparison.

StyMaar commented 6 years ago

Oh, I forgot to tell which version I was running : master (b2da1856bb407339f2f8737f19bed42954d33286) built with rust 1.19 (cargo build --release)

StyMaar commented 6 years ago

fdupes is pretty irregular, but faster (1-10MB/s)

StyMaar commented 6 years ago

the raw figures aren't really interesting (it's a RAID array with encryption, which slows things down), but I think the difference with other tools is relevant.

StyMaar commented 6 years ago

Increasing the number of threads in the thread pool (I arbitrarily chose 20) helped me reach 10MB/s during the first part of the process (when walking the directories and hashing files), and during the second part (exact file comparison) I'm currently around 40MB/s. For the second part, I don't really know if increasing the number of threads changed anything.

Boscop commented 6 years ago

According to this benchmark my HDD (Seagate HDD 1TB, ST1000LM014) performs best when the number of outstanding IO operations is 32. Does that mean I should use a threadpool of 32+1 threads?