Very slow file collection - Githubissues

arsenetar / dupeguru

Find duplicate files

https://dupeguru.voltaicideas.net

GNU General Public License v3.0

5.44k stars 415 forks source link

Very slow file collection #962

Open fotico opened 2 years ago

fotico commented 2 years ago

Describe the bug File collection is very slow.

To Reproduce

Add an entire drive (>1M files)
Start a normal scan
"Collecting files to scan" step takes 3 hours

Expected behavior Scan the drive should be done in under 20 minutes. A full scan using WinDirStat on the same drive, which also collects file sizes, takes about 10 minutes in total

Desktop

OS: Windows 8.1

chchia commented 2 years ago

second that.

i intentionally make 3 programs to run on the same folder hosted in Synology shared with 2 millions files at the same time:

WizTree running on windows, completed in about 8 minutes.
CZKawka running on Linux box, completed in about 10 minutes
Dupeguru running on the same Linux box, completed about 18 minutes

somehow Dupeguru is slower in collecting files for scan.

Dobatymo commented 2 years ago

os.walk should be replaced with os.scandir and the filesize collected at scan time to save on syscall per file on Windows.

arsenetar commented 2 years ago

There are multiple items impacting the performance here, os.listdir is still used in some places, the file and folder classes are implementing functionality that may now be better to leave to other python base classes and methods, really there would need to be a bit of rework to really improve performance here, updating the one os.walk call with os.scandir (which is used by os.walk internally) does yield an improvement however other parts of the file and folder collection need additional updates to see any drastic improvement it seems from some local testing.

arsenetar commented 2 years ago

I have confirmed with some initial testing that it is possible to see significant performance improvements rewriting the underlying file and folder classes around os.scandir and the resulting os.DirEntry objects, depending on the particular operation I saw a 4x to 10x improvement in speed. This will take a bit to pull these sort of changes in as there are other updates to make sure all existing functionality remains (my testing was focused on the collection and scan portion only).

chchia commented 2 years ago

i will be available for test if you need.

Dobatymo commented 2 years ago

Just linking this related feature request https://github.com/arsenetar/dupeguru/issues/959

arsenetar commented 2 years ago

@fotico, @Dobatymo and @chchia if you are interested in building from source, the latest commit https://github.com/arsenetar/dupeguru/commit/efd500ecc1eb604918da3fc01512c502912771d8 has several improvements to the file collection. In testing I am seeing some good improvements in speed and it still seems to work as expected. There is still a bit more that could be done but this seems to be much better.

chchia commented 2 years ago

i confirm latest source have much faster scan speed! thanks! it is a huge improvement.

arsenetar commented 2 years ago

@fotico, @Dobatymo, and @chchia pushed another update in https://github.com/arsenetar/dupeguru/commit/c5818b1d1f78be9201c5e3164177361fea0bf629 that adds a preference for profiling scan operations. This logs the number of calls and time spent within functions when running a scan. These logs can be used to determine where time is being spent. Right now I don't think there is a lot left to speed up beyond going to multiple threads (which I am going to put off for now) here with my testing so added the ability to get these logs to determine what users are seeing for further optimization.

chchia commented 2 years ago

thank you, how do i read the content of .profile file?

arsenetar commented 2 years ago

@chchia, sorry probably should have provided some information on that. Logs are created by python's cProfile profiler, so there are probably several ways to read them. I normally use https://jiffyclub.github.io/snakeviz/ to view them. I'll also note that there are two top level functions called captured by the profile get_dupe_groups() from scanner.py and then either get_files() or get_folders() from directories.py depending on the scan type.

arsenetar commented 2 years ago

@chchia @Dobatymo @fotico the lastest version should be faster under most circumstances. Let me know if you find otherwise.

chchia commented 2 years ago

i confirm latest version is much faster than previous.