Interactive navigation for tree with millions of files is really slow

inga-lovinde commented 10 months ago

I observe some weird behavior with dua on large trees.

The scanned directory is 13M files total (across all subdirectories). It took dua ~7-8 hours to load it at 200 threads (it's a network mount, as described in #224); memory usage when in interactive mode is ~1.5GB.

I'm in interactive mode, in a directory with 1000 entries (all are subdirectories), with subtree containing 2M entries total. I try to enter a subdirectory, it has under 193 entries (all are subdirectories), with its subtree containing just 2k entries. It takes ~10 seconds. I try to enter its subdirectory, it has 8 entries (all are subdirectories), subtree containing 48 entries. It takes around a second. I try to enter its subdirectory, it has 1 entry (subdirectory), subtree containing 5 entries, 3 levels deep inside of it there are 2 files. Navigating inside of it only takes a (noticeable) fraction of second for every action. Navigating (using the u key) from the subdirectory with subtree containing 5 entries to subtree containing 48 entries takes around a second. Pressing u key there (to get to the subdirectory with subtree containing 2k entries) takes ~10 seconds. Pressing u key again (to get to the subdirectory with subtree containing 2M entries) takes around a minute. Pressing u key again to get to the subdirectory with 8 entries and subtree containing 9M entries takes under a second, so this seems to be related to the number of entries in a directory, not to the size of its subtree.

By "takes XX seconds" I mean that the UI is stuck in the previous state and is completely unresponsive all this time; yet, oddly enough, dua CPU usage hovers around zero all this time.

(For reference, doing ls on the same directories takes under a second.)

That's on dua v2.26.0 from Alpine package repository (Alpine uses musl instead of libc, not sure if that's relevant here).

Byron commented 10 months ago

Thanks for reporting.

What matters here is that lstat calls are very slow, which is run nearly unconditionally whenever an entry is entered. This is merely to detect presence, as it will highlight entries that aren't present anymore in red. This also means it won't detect type changes, i.e. dir to flie or vice-versa.

Maybe there is a fee to be paid when the lstat call goes though longer paths, which is probably why there is a difference between ls and the way this works here.

To fix it, one could probably just add a flag that prevents these lstat checks entirely. Independently of that, one could try to to speed up the procedure by doing a directory listing instead through which metadata might come back faster. But that's to be validated.

I will implement the command-line flag to disable this behaviour, but also encourage you to see if read_dir + DirEntry::metadata() is faster than a bunch of symlink_metadata() calls on the known entries in a directory. If that's the case, a contribution of that algorithm would make it faster for everyone even if that flag isn't used.

Byron commented 10 months ago

When thinking about this use-case I can also think of another improvement: How useful would it be to be able to dump the graph in some format, say JSON, so it could be made accessible by other tooling? Waiting 8 hours for something that is lost when hitting q seems like waste to be avoided.

From there I could absolutely imagine to add additional sub-commands that can handle these snapshot files, to compute changes between them for example. The simplest case would probably be to be able to use the UI to visualise such a snapshot, instead of re-scanning the underlying tree.

Maybe @piotrwach is interested in this as well :).

Byron commented 10 months ago

Mitigated in https://github.com/Byron/dua-cli/releases/tag/v2.28.0 .

gosuwachu commented 10 months ago

@Byron Yes, I may be interested :) Depending on how it is implemented I think dua could then also be used for analysing directory trees from other sources, like for example S3,. E.g. all that would be required is to dump the object listing in the right format:

<name> <size> <mtime>

Such file then could be loaded in dua in interactive mode. What this would potentially also allow is interactively analysing file systems on other machines without actually running dua there: ssh remote-server find /directory/to/scan -t f | dua i --from-traversal

Byron commented 10 months ago

I absolutely love your thoughts! With this feature, dua i would become a front-end for viewing and organizing a graph data structure, which of course can also be created by other tools then.

ssh find… | dua i --from-traversal seems like a separate feature though as that new data structure wouldn't be involved directly. I also don't see how a simple find would produce all the data that's needed per entry, which I find critical to make this 'remote crawler that isn't dua' scenario work (after all, if dua was on the remote one could just use that). However, if that could be made to work, that would definitely be a nice feature to have as well.

For the terminal application input now is just a stream of events, and it doesn't have to care at all about their source anymore.

Byron / dua-cli

Interactive navigation for tree with millions of files is really slow #227