Open inga-lovinde opened 10 months ago
Thanks a lot for reporting, a very interesting case!
I agree that more threads at merely 500 entries/s shouldn't be able to hog a whole CPU. Profiling would be needed though to see if it's spending time in the kernel or if it's truly in user-space, but for now it's fine to assume it's indeed something in dua
. The reason I believe this could be the case is jwalk
, which I don't actually understand and whose parallelization uses rayon
in a way that I found uses quite a bit of CPU, to the point where the overall performance is impacted.
The next generation traversal engine fixes the CPU problem, that much I could validate already, even though it's unclear when the underlying moonwalk
crate will be ready.
I tried to run dua
several more times, with different concurrency, and the interesting thing is that when I launch it with -t 1
, top
reports CPU consumption around 1% (of a single core); but when I launch it with -t 2
, top
reports CPU consumption around 100%, and the scanning speed is ~twice lower than with -t 1
.
And then the scanning speed grows roughly linear from there; with 4 threads it's ~twice faster than with 2 threads; with 200 threads it's ~twice faster than with 100 threads. CPU consumption with 100 threads is around 110%, with 200 threads around 120%, indicating that the scanning threads are not CPU-bound; it's just something in the main thread that is, every time that threads
option is not set to 1.
So probably that's the same problem you know about, related to jwalk
(https://github.com/Byron/dua-cli/blob/v2.27.2/src/common.rs#L220-L229)
I should probably also mention that I observe this behavior in non-interactive mode.
For posterity, here is a profile run of a dua -t2
run. Most of the time spent is in the kernel, and if it is to be believed it's mostly in cthread_yield
. Maybe there is a spinlock somewhere that gets hot as most of the time there isn't much to do. Maybe there are too many calls for yielding the thread in that case.
This is probably related to how jwalk
uses rayon
, but fixing this would require a deeper understanding of both which I don't have.
To me the next-gen traversal engine will be the solution, and even if not at first, @pascalkuthe will make sure it will be :).
yeah the moonwalk tansversal engine avoids almost all locking (its lockfree) but its very very hard to get right. Eventually we will solve this :)
I couldn't find any issue describing this scenario, so I'll just leave it here, so that at least it's known.
I'm trying to run
dua
against a remote share (Hetzner storage box mounted over SMB on non-hetzner Linux-based system, to be precise). The file structure is: several millions files, ~10 layers deep, with directories' sizes ranging from 1 to hundreds of subdirectories, and files always being in a leaf directory with 1-2 files (which means there are several millions directories as well).The problem is that I/O for this remote share has a really high latency, so with
ncdu
scanning progresses at around 5-10 files per second, and withdust
, at maybe 15-30 files per second.With
dua
, it seems to scale linearly with the number of threads... at first. So even at 100 threads I'm initially getting better results than at 50 threads, at least in the first seconds of the scan. However, very quickly I get constrained by CPU for some reason, withdua
consuming a full single CPU core (more specifically,dua
's CPU usage intop
seems to hover around 110%, on a modern AMD CPU with eight physical cores). The performance stabilizes at ~1-1.5 million files per hour (or 300-500 files per second). Which is incredibly better thanncdu
ordust
, but it still feels likedua
is not supposed to be single-core-bound in this scenario, not at 500 files (or subdirectories) at least? I understand that non-concurrent parts of the code can be a bottleneck in some scenarios, but maxing out a fast CPU at just 500 files / subdirectories per second does not seem right, so maybe there is something else going on.(That's on
dua
v2.26.0 from Alpine package repository.)