Performance is constrained by single-core CPU speed when I/O does not depend on CPU

inga-lovinde commented 10 months ago

I couldn't find any issue describing this scenario, so I'll just leave it here, so that at least it's known.

I'm trying to run dua against a remote share (Hetzner storage box mounted over SMB on non-hetzner Linux-based system, to be precise). The file structure is: several millions files, ~10 layers deep, with directories' sizes ranging from 1 to hundreds of subdirectories, and files always being in a leaf directory with 1-2 files (which means there are several millions directories as well).

The problem is that I/O for this remote share has a really high latency, so with ncdu scanning progresses at around 5-10 files per second, and with dust, at maybe 15-30 files per second.

With dua, it seems to scale linearly with the number of threads... at first. So even at 100 threads I'm initially getting better results than at 50 threads, at least in the first seconds of the scan. However, very quickly I get constrained by CPU for some reason, with dua consuming a full single CPU core (more specifically, dua's CPU usage in top seems to hover around 110%, on a modern AMD CPU with eight physical cores). The performance stabilizes at ~1-1.5 million files per hour (or 300-500 files per second). Which is incredibly better than ncdu or dust, but it still feels like dua is not supposed to be single-core-bound in this scenario, not at 500 files (or subdirectories) at least? I understand that non-concurrent parts of the code can be a bottleneck in some scenarios, but maxing out a fast CPU at just 500 files / subdirectories per second does not seem right, so maybe there is something else going on.

(That's on dua v2.26.0 from Alpine package repository.)

Byron commented 10 months ago

Thanks a lot for reporting, a very interesting case!

I agree that more threads at merely 500 entries/s shouldn't be able to hog a whole CPU. Profiling would be needed though to see if it's spending time in the kernel or if it's truly in user-space, but for now it's fine to assume it's indeed something in dua. The reason I believe this could be the case is jwalk, which I don't actually understand and whose parallelization uses rayon in a way that I found uses quite a bit of CPU, to the point where the overall performance is impacted.

The next generation traversal engine fixes the CPU problem, that much I could validate already, even though it's unclear when the underlying moonwalk crate will be ready.

inga-lovinde commented 10 months ago

I tried to run dua several more times, with different concurrency, and the interesting thing is that when I launch it with -t 1, top reports CPU consumption around 1% (of a single core); but when I launch it with -t 2, top reports CPU consumption around 100%, and the scanning speed is ~twice lower than with -t 1.

And then the scanning speed grows roughly linear from there; with 4 threads it's ~twice faster than with 2 threads; with 200 threads it's ~twice faster than with 100 threads. CPU consumption with 100 threads is around 110%, with 200 threads around 120%, indicating that the scanning threads are not CPU-bound; it's just something in the main thread that is, every time that threads option is not set to 1.

So probably that's the same problem you know about, related to jwalk (https://github.com/Byron/dua-cli/blob/v2.27.2/src/common.rs#L220-L229)

I should probably also mention that I observe this behavior in non-interactive mode.

Byron commented 10 months ago

For posterity, here is a profile run of a dua -t2 run. Most of the time spent is in the kernel, and if it is to be believed it's mostly in cthread_yield. Maybe there is a spinlock somewhere that gets hot as most of the time there isn't much to do. Maybe there are too many calls for yielding the thread in that case.

This is probably related to how jwalk uses rayon, but fixing this would require a deeper understanding of both which I don't have.

To me the next-gen traversal engine will be the solution, and even if not at first, @pascalkuthe will make sure it will be :).

pascalkuthe commented 10 months ago

yeah the moonwalk tansversal engine avoids almost all locking (its lockfree) but its very very hard to get right. Eventually we will solve this :)

Byron / dua-cli

Performance is constrained by single-core CPU speed when I/O does not depend on CPU #224