Suggestion: progress percentage when calculating hashes in b3sum

darkred commented 3 years ago

This is a follow-up to my comment here.

My suggestion is to add in b3sum a progress percentage when calculating hashes. Currently it doesn't show anything until hash calculation is finished.

Something like RHash's --percents:

-P, --percents
Show percents, while calculating or checking sums

2021-10-23_224654

Thank you

oconnor663 commented 3 years ago

This is maybe a little too fancy for what b3sum wants to be (mostly a drop-in replacement for md5sum and other Coreutils tools), but that aside, there are also some architectural issues. In most cases, all the hashing work is done by this one library call. If we wanted progress info, we'd need to get it from that blake3 library call; it's not something b3sum can do by itself in this common codepath. But putting some sort of progress channel into that low-level API doesn't feel right to me.

For what it's worth, are you noting very poor performance? If hashing a file taking multiple seconds, it might be that you're seeing these issues:

We plan to fix those with a different multithreading strategy, and it might be that once that fix ships, the need for a progress bar might be a lot lower?

darkred commented 3 years ago

Thanks for your response.

We plan to fix those with a different multithreading strategy, and it might be that once that fix ships, the need for a progress bar might be a lot lower?

My purpose is to be able to use b3sum to calculate BLAKE3 hashes of large files (always of many GBs, up to 100), and which are (almost) always located in a spinning disk, rarely in a SSD.

Regarding performance, for testing I have tried hashing various small files on a spinning disk, and the calculation completes quickly. But, every time I try to hash a large file (e.g. 30 GB) on a spinning disk, I'm quickly getting the 99% memory usage issue, and that's why I don't even let b3sum complete - I immediately cancel it (Ctrl+C) .

So, even if you fix all performance issues and the 99% memory usage, I believe that hash calculation of large files, e.g. 30 GB file, will always take several minutes on a spinning disk, and in that case, showing a progress percentage would be very useful in order to perceive the ETA of the procedure.

oconnor663 commented 3 years ago

You might want to experiment with --no-mmap, or with hashing stdin, which currently has the same effect. That will avoid disk thrashing, and as a side effect you won't see the high apparent memory usage. The downside is that you lose the benefits of multithreading (at least for now), but if you know you're reading from disk and not from cached RAM, then the CPU probably won't be your bottleneck anyway.

Of course, telling users that they need to know about this effect and these flags is a pretty bad look. We've considered disabling mmap on Windows by default for this reason, though it's not exactly clear why Windows seems to suffer more than Linux from spinning disks.

darkred commented 3 years ago

Thank you, with --no-mmap there's no 99% memory usage.

But, with a 27 GB file, it took 3 min 20 sec to complete. That is a rather long time to wait. In that case, a progress percentage would be very useful to have. Currently, the only way to see at what point the calculation is, is to use OpenedFilesView's % Position column on the parent folder: 2021-10-24_184805

Also, please note that my spinning disk is not a slow one: during hash calculation the file is read at speed between 120-150 MB/sec. Below is a screenshot from task manager:

Screenshot

![2021-10-24_190729](https://user-images.githubusercontent.com/723651/138602531-1f320209-652a-4726-a729-1e1b3dd18ffa.jpg)

oconnor663 commented 3 years ago

I'm not sure what the options are on Windows, but on Linux there's the pv tool which might work for this purpose. Here's an example from my terminal:

$ pv /var/tmp/bigfile | b3sum
11.7GiB 0:00:07 [1.62GiB/s] [========>      ] 62% ETA 0:00:04

$ pv /var/tmp/bigfile | b3sum
18.6GiB 0:00:11 [1.64GiB/s] [=============>] 100%
8277aa1fd06b7d0f06a8b3c28569b954a5f598503916cd8dc1b457a67f4dc389  -
$

(Note that hashing stdin like this hash roughly the same effect as using --no-mmap.)

darkred commented 3 years ago

Indeed, pv is an excellent Linux tool, thank you for suggesting it ! Too bad I can't find an equivalent for Windows.

casey commented 2 months ago

I wanted to +1 this. Or, at least +1 the idea of adding the ability to do progress callbacks when hashing a large amount of data. I'm working on a file hashing and verification tool, similar to shasum, b3sum, and friends, which has other features like signature creation and verification, and I use the blake3 crate, and currently display a progress bar which only gets updated after each file is hashed. It would be nice if I could update it continuously as data is being hashed.

All that being said, I feel like progress callbacks really don't belong in a low-level library like blake3, so really the impact would have to be minimal, and probably be behind a feature flag, for it to be worth it.

oconnor663 commented 2 months ago

This isn't a real answer, but more of a discussion prompt. Here's a quick and dirty demo of how you could use indicatif::ProgressBar together with blake3::Hasher as it exists today:

use indicatif::ProgressBar;
use std::env::args;
use std::fs::File;
use std::io;
use std::path::Path;

struct ProgressFile {
    file: File,
    progress: ProgressBar,
}

impl ProgressFile {
    fn open(path: impl AsRef<Path>) -> io::Result<Self> {
        let file = File::open(path.as_ref())?;
        let len = file.metadata()?.len();
        let progress = ProgressBar::new(len);
        Ok(Self { file, progress })
    }
}

impl io::Read for ProgressFile {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let n = self.file.read(buf)?;
        self.progress.inc(n as u64);
        Ok(n)
    }
}

fn main() -> anyhow::Result<()> {
    let path = args().nth(1).unwrap();
    let file = ProgressFile::open(path)?;
    let hash = blake3::Hasher::new().update_reader(file)?.finalize();
    println!("{hash}");
    Ok(())
}

With a big temp file on my machine:

$ cargo run --release /tmp/f
    Finished `release` profile [optimized] target(s) in 0.02s
     Running `target/release/scratch /tmp/f`
███████████████████████████████░░░░░░░░░░░░░░░ 3432251392/5000000000

Of course the big downside of doing things this way is that Hasher::update_reader is single threaded. In a lot of cases, this will be a lot slower than Hasher::update_mmap_rayon, the multithreaded approach that b3sum uses by default. But a few thoughts about that:

With big files fully in cache, on a desktop or laptop CPU made in the last decade, b3sum and update_mmap_rayon will usually hit something like 10 GB/s of throughput. That's fast enough that you might not need the progress bar?
With big files not in cache, hashing is going to be much slower, because you're disk-bound. However, if you know you're in that case, then Hasher::update_reader will either be ~equally fast (on an SSD) or actually much faster (on an HDD where seeking/thrashing are costly).
In common cases, it's hard to know in advance whether a file is in cache. This is a big source of frustration for b3sum, because we have to pick a default behavior, and whatever we pick will be bad for some users. However, if you know you're hashing lots of files at once, you might be able to get the best of both worlds by restricting each file to a single thread and trusting that multi-file parallelism is enough to occupy all your cores. (I haven't benchmarked this, but hopefully the OSs are better at avoiding thrashing among many files than within a single file?)

Thoughts?

oconnor663 commented 2 months ago

I haven't tried this either myself, but since Hasher uses Rayon-based multithreading, I wonder if a Rayon-based many-file/directory-tree walk might naturally do the "right" thing: fan out the worker threads to different files as much as possible, but let them work-steal parts of big files if there are no more files to conquer...

casey commented 2 months ago

A custom reader is super clever!

I haven't tried this either myself, but since Hasher uses Rayon-based multithreading, I wonder if a Rayon-based many-file/directory-tree walk might naturally do the "right" thing: fan out the worker threads to different files as much as possible, but let them work-steal parts of big files if there are no more files to conquer...

That's an interesting idea. I didn't consider multi-file parallelism, but it sounds like it might be a good approach when you don't know what level of single-file parallelism is appropriate.

BLAKE3-team / BLAKE3

Suggestion: progress percentage when calculating hashes in b3sum #205