Byron / dua-cli

View disk space usage and delete unwanted data, fast.
https://lib.rs/crates/dua-cli
MIT License
3.67k stars 102 forks source link

performance: dua-cli full scan takes way longer than gdu #223

Closed glowinthedark closed 5 months ago

glowinthedark commented 5 months ago

Directory scanning with dua i /some/folder takes orders of magnitude longer compared to gdu even when setting -t <some-number-bigger-than-number-of-cpu-cores>.

Didn't do any proper benchmarks, but just an example, while dua shows progress info with number of scanned files around 64k gdu in the same time on the same folder reaches around 300k+ files. dua-cli takes minutes longer to complete a full scan.

The huge speed difference has been observed with APFS (macos), HFS+ (macos), exFat (macos, linux), EXT4 (Linux with both armv7/arm64 and intel cpu's).

Note: dua-cli is still as fast or faster than ncdu, so apparently it's gdu that does some serious optimizations to speed up the scan. On macos APFS gdu full scan takes less time than calling ootb Apple's Finder Get Info on the same folder.

Byron commented 5 months ago

Thanks for reporting!

Can you try to use hyperfine and see the impact of the thread-count on performance? Note that I threw in pdu as well as it usually is the fastest way to iterate.

root=<path-to-measure>
hyperfine -N -w1 -M2 "gdu $root" "dua -t1 $root" "dua -t2 $root" "dua -t4 $root" "dua -t8 $root" "pdu $root"

The theory is that dua uses too many threads which can actually hurt performance on MacOS, and I noticed that 3 to 4 threads is usually giving the best performance. Maybe there is a number that is bringing it closer to gdu. Lastly, pdu is typically faster than dua and I'd expect it to be as fast as gdu or faster. Please note that it has flags for thread-counts as well, in case you want to dive deeper if the results are interesting. Also note that this uses the non-interactive version of dua which uses the same traversal engine under the hood.

glowinthedark commented 5 months ago

@Byron

hyperfine results

linux arm64 ext4 (772.98 GiB total, HDD)

Click for system details RAM: 4 GB ```bash $ uname -a Linux iq 6.1.0-rpi7-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24) aarch64 GNU/Linux $ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: ARM Model name: Cortex-A76 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 4 Socket(s): - Cluster(s): 1 Stepping: r4p1 CPU(s) scaling MHz: 100% CPU max MHz: 2400.0000 CPU min MHz: 1000.0000 BogoMIPS: 108.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ```
Summary
  'gdu /media/t12/Music' ran
    1.07 ± 0.01 times faster than 'dua -t2 /media/t12/Music'
    1.13 ± 0.00 times faster than 'dua -t4 /media/t12/Music'
    1.31 ± 0.02 times faster than 'dua -t8 /media/t12/Music'
    1.49 ± 0.01 times faster than 'dua -t1 /media/t12/Music'

macos APFS (78,48 GiB total, built-in SSD)

Click for system details ```bash #uname -a Darwin NCM38333.local 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:53 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6020 arm64 Chip: Apple M2 Pro Total Number of Cores: 12 (8 performance and 4 efficiency) Memory: 32 GB ```
Summary
  dua -t8 ~/projects ran
    1.08 ± 0.00 times faster than pdu ~/projects
    1.30 ± 0.00 times faster than dua -t4 ~/projects
    1.50 ± 0.01 times faster than gdu ~/projects
    2.16 ± 0.00 times faster than dua -t2 ~/projects
    3.94 ± 0.02 times faster than dua -t1 ~/projects

The non-interactive dua mode is performing great, i.e. dua -t8 ~/projects is very fast on APFS.

The slowness is observed with interactive mode with e.g. dua -t8 i ~/projects which takes almost forever. Not sure what would be the hyperfine command for testing interactive mode as I suppose it probably cannot handle tty mode (?)

Byron commented 5 months ago

Thanks for the measurements, very interesting results!

It's very interesting that gdu manages to be this much faster on Linux, and thread-scaling doesn't seem to do dua much good with -t2 being the best value on a 4-core machine.

On MacOS it scales much better, but the question remains why it's slow in interactive mode.

I have a hunch and implemented a fix in #225, which you are invited to try out. If you'd say that the ~/projects folder as a lot of top-level entries, then my hunch might be true.

Something you could also check is how many threads gdu uses by default - it's entirely unclear to me why it's so much faster on Linux except that maybe it's related to internal inefficiencies during traversal which weigh dua down (see #224). Edit: Maybe it's also related to the HDD being less receptive to the order of traversal or something related to it due to generally higher latencies. Whatever it is that makes it faster on SSD might be what makes it slower on HDD.

PS: I have made a new release with the fix, and would hope it will improve the situation as this is the only guess I had: https://github.com/Byron/dua-cli/releases/tag/v2.27.2 . Should it still not release the handbreaks you'd probably need to instrument a run, but we get there when we get there.

glowinthedark commented 5 months ago

compiling for apple silicon on macos m2 throws an error while running cargo install dua-cli

error[E0446]: crate-private type `FilesystemScan` in public interface
  --> ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dua-cli-2.27.2/src/interactive/app/state.rs:42:5
   |
27 | pub(crate) struct FilesystemScan {
   | -------------------------------- `FilesystemScan` declared as crate-private
...
42 |     pub scan: Option<FilesystemScan>,
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't leak crate-private type

For more information about this error, try `rustc --explain E0446`.
error: could not compile `dua-cli` (bin "dua") due to previous error
error: failed to compile `dua-cli v2.27.2`, intermediate artifacts can be found at `/var/folders/py/73sb2fsj37xbmtkgw111l07w0000gp/T/cargo-installoMoXeN`.
To reuse those artifacts with a future compilation, set the environment variable `CARGO_TARGET_DIR` to that path.

same error when explicitly checking out the tag (both on macos m2 and linux arm64):

git clone https://github.com/Byron/dua-cli.git && cd dua-cli
git checkout tags/v2.27.2
cargo build --release

Tried the Intel X86 binary from the releases — completes Ok:

/tmp/dua-v2.27.2-x86_64-apple-darwin/dua i ~/projects
Sort mode: size descending  Total disk usage: 149.07 GB  
Processed 1743246 entries in 9.81s 

the original m2-binary (v2.20.1 arm64) still shows scanning apparently even after scanning finished (although the number of entries is not identical) 🤔

Entries: 1 in 0s (472/s)  -> scanning <- 149.07 GB  
Entries: 1743248 in 8.99s
Byron commented 5 months ago

compiling for apple silicon on macos m2 throws an error while running cargo install dua-cli

This is fixed now in main, see #226 .

the original m2-binary (v2.20.1 arm64) still shows scanning apparently even after scanning finished (although the number of entries is not identical) 🤔

This typically means that it is indeed still scanning, but all threads are stalled, presumably. I recommend to try again building the latest version. Let's see.

glowinthedark commented 5 months ago

pulling, building and running latest main now makes dua -t8 i .. finish scanning in about the same time as gdu with just ~2..3 seconds difference on macos m2 (1744024 entries in 22.25s), on linux rpi 5 arm64 8GB RAM scanning a 765GB file system tree on a NVME m2 drive takes roughly equal time as gdu (723.05 GiB Processed 640603 entries in 5.25s), hard to tell the difference

thank you so much for taking the time to look into this — much appreciated! 🙏

Byron commented 5 months ago

Thanks so much for letting me know, it's much appreciated, too :).

It's great to hear that the fix did indeed work, and that gdu isn't unconditionally faster anymore :).

Closing, as it sounds like this issue is no more.