Open fotico opened 2 years ago
second that.
i intentionally make 3 programs to run on the same folder hosted in Synology shared with 2 millions files at the same time:
somehow Dupeguru is slower in collecting files for scan.
os.walk
should be replaced with os.scandir
and the filesize collected at scan time to save on syscall per file on Windows.
There are multiple items impacting the performance here, os.listdir
is still used in some places, the file and folder classes are implementing functionality that may now be better to leave to other python base classes and methods, really there would need to be a bit of rework to really improve performance here, updating the one os.walk
call with os.scandir
(which is used by os.walk
internally) does yield an improvement however other parts of the file and folder collection need additional updates to see any drastic improvement it seems from some local testing.
I have confirmed with some initial testing that it is possible to see significant performance improvements rewriting the underlying file and folder classes around os.scandir
and the resulting os.DirEntry
objects, depending on the particular operation I saw a 4x to 10x improvement in speed. This will take a bit to pull these sort of changes in as there are other updates to make sure all existing functionality remains (my testing was focused on the collection and scan portion only).
i will be available for test if you need.
Just linking this related feature request https://github.com/arsenetar/dupeguru/issues/959
@fotico, @Dobatymo and @chchia if you are interested in building from source, the latest commit https://github.com/arsenetar/dupeguru/commit/efd500ecc1eb604918da3fc01512c502912771d8 has several improvements to the file collection. In testing I am seeing some good improvements in speed and it still seems to work as expected. There is still a bit more that could be done but this seems to be much better.
i confirm latest source have much faster scan speed! thanks! it is a huge improvement.
@fotico, @Dobatymo, and @chchia pushed another update in https://github.com/arsenetar/dupeguru/commit/c5818b1d1f78be9201c5e3164177361fea0bf629 that adds a preference for profiling scan operations. This logs the number of calls and time spent within functions when running a scan. These logs can be used to determine where time is being spent. Right now I don't think there is a lot left to speed up beyond going to multiple threads (which I am going to put off for now) here with my testing so added the ability to get these logs to determine what users are seeing for further optimization.
thank you, how do i read the content of .profile file?
@chchia, sorry probably should have provided some information on that. Logs are created by python's cProfile profiler, so there are probably several ways to read them. I normally use https://jiffyclub.github.io/snakeviz/ to view them. I'll also note that there are two top level functions called captured by the profile get_dupe_groups()
from scanner.py and then either get_files()
or get_folders()
from directories.py depending on the scan type.
@chchia @Dobatymo @fotico the lastest version should be faster under most circumstances. Let me know if you find otherwise.
i confirm latest version is much faster than previous.
Describe the bug File collection is very slow.
To Reproduce
Expected behavior Scan the drive should be done in under 20 minutes. A full scan using WinDirStat on the same drive, which also collects file sizes, takes about 10 minutes in total
Desktop