Open darkred opened 3 years ago
This is maybe a little too fancy for what b3sum
wants to be (mostly a drop-in replacement for md5sum
and other Coreutils tools), but that aside, there are also some architectural issues. In most cases, all the hashing work is done by this one library call. If we wanted progress info, we'd need to get it from that blake3
library call; it's not something b3sum
can do by itself in this common codepath. But putting some sort of progress channel into that low-level API doesn't feel right to me.
For what it's worth, are you noting very poor performance? If hashing a file taking multiple seconds, it might be that you're seeing these issues:
We plan to fix those with a different multithreading strategy, and it might be that once that fix ships, the need for a progress bar might be a lot lower?
Thanks for your response.
We plan to fix those with a different multithreading strategy, and it might be that once that fix ships, the need for a progress bar might be a lot lower?
My purpose is to be able to use b3sum to calculate BLAKE3 hashes of large files (always of many GBs, up to 100), and which are (almost) always located in a spinning disk, rarely in a SSD.
Regarding performance, for testing I have tried hashing various small files on a spinning disk, and the calculation completes quickly. But, every time I try to hash a large file (e.g. 30 GB) on a spinning disk, I'm quickly getting the 99% memory usage issue, and that's why I don't even let b3sum complete - I immediately cancel it (Ctrl+C) .
So, even if you fix all performance issues and the 99% memory usage, I believe that hash calculation of large files, e.g. 30 GB file, will always take several minutes on a spinning disk, and in that case, showing a progress percentage would be very useful in order to perceive the ETA of the procedure.
You might want to experiment with --no-mmap
, or with hashing stdin, which currently has the same effect. That will avoid disk thrashing, and as a side effect you won't see the high apparent memory usage. The downside is that you lose the benefits of multithreading (at least for now), but if you know you're reading from disk and not from cached RAM, then the CPU probably won't be your bottleneck anyway.
Of course, telling users that they need to know about this effect and these flags is a pretty bad look. We've considered disabling mmap on Windows by default for this reason, though it's not exactly clear why Windows seems to suffer more than Linux from spinning disks.
Thank you, with --no-mmap
there's no 99% memory usage.
But, with a 27 GB file, it took 3 min 20 sec to complete. That is a rather long time to wait.
In that case, a progress percentage would be very useful to have.
Currently, the only way to see at what point the calculation is, is to use OpenedFilesView's % Position
column on the parent folder:
Also, please note that my spinning disk is not a slow one: during hash calculation the file is read at speed between 120-150 MB/sec. Below is a screenshot from task manager:
I'm not sure what the options are on Windows, but on Linux there's the pv
tool which might work for this purpose. Here's an example from my terminal:
$ pv /var/tmp/bigfile | b3sum
11.7GiB 0:00:07 [1.62GiB/s] [========> ] 62% ETA 0:00:04
$ pv /var/tmp/bigfile | b3sum
18.6GiB 0:00:11 [1.64GiB/s] [=============>] 100%
8277aa1fd06b7d0f06a8b3c28569b954a5f598503916cd8dc1b457a67f4dc389 -
$
(Note that hashing stdin like this hash roughly the same effect as using --no-mmap
.)
Indeed, pv
is an excellent Linux tool, thank you for suggesting it !
Too bad I can't find an equivalent for Windows.
I wanted to +1 this. Or, at least +1 the idea of adding the ability to do progress callbacks when hashing a large amount of data. I'm working on a file hashing and verification tool, similar to shasum
, b3sum
, and friends, which has other features like signature creation and verification, and I use the blake3
crate, and currently display a progress bar which only gets updated after each file is hashed. It would be nice if I could update it continuously as data is being hashed.
All that being said, I feel like progress callbacks really don't belong in a low-level library like blake3
, so really the impact would have to be minimal, and probably be behind a feature flag, for it to be worth it.
This isn't a real answer, but more of a discussion prompt. Here's a quick and dirty demo of how you could use indicatif::ProgressBar
together with blake3::Hasher
as it exists today:
use indicatif::ProgressBar;
use std::env::args;
use std::fs::File;
use std::io;
use std::path::Path;
struct ProgressFile {
file: File,
progress: ProgressBar,
}
impl ProgressFile {
fn open(path: impl AsRef<Path>) -> io::Result<Self> {
let file = File::open(path.as_ref())?;
let len = file.metadata()?.len();
let progress = ProgressBar::new(len);
Ok(Self { file, progress })
}
}
impl io::Read for ProgressFile {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
let n = self.file.read(buf)?;
self.progress.inc(n as u64);
Ok(n)
}
}
fn main() -> anyhow::Result<()> {
let path = args().nth(1).unwrap();
let file = ProgressFile::open(path)?;
let hash = blake3::Hasher::new().update_reader(file)?.finalize();
println!("{hash}");
Ok(())
}
With a big temp file on my machine:
$ cargo run --release /tmp/f
Finished `release` profile [optimized] target(s) in 0.02s
Running `target/release/scratch /tmp/f`
███████████████████████████████░░░░░░░░░░░░░░░ 3432251392/5000000000
Of course the big downside of doing things this way is that Hasher::update_reader
is single threaded. In a lot of cases, this will be a lot slower than Hasher::update_mmap_rayon
, the multithreaded approach that b3sum
uses by default. But a few thoughts about that:
b3sum
and update_mmap_rayon
will usually hit something like 10 GB/s of throughput. That's fast enough that you might not need the progress bar?Hasher::update_reader
will either be ~equally fast (on an SSD) or actually much faster (on an HDD where seeking/thrashing are costly).b3sum
, because we have to pick a default behavior, and whatever we pick will be bad for some users. However, if you know you're hashing lots of files at once, you might be able to get the best of both worlds by restricting each file to a single thread and trusting that multi-file parallelism is enough to occupy all your cores. (I haven't benchmarked this, but hopefully the OSs are better at avoiding thrashing among many files than within a single file?)Thoughts?
I haven't tried this either myself, but since Hasher
uses Rayon-based multithreading, I wonder if a Rayon-based many-file/directory-tree walk might naturally do the "right" thing: fan out the worker threads to different files as much as possible, but let them work-steal parts of big files if there are no more files to conquer...
A custom reader is super clever!
I haven't tried this either myself, but since Hasher uses Rayon-based multithreading, I wonder if a Rayon-based many-file/directory-tree walk might naturally do the "right" thing: fan out the worker threads to different files as much as possible, but let them work-steal parts of big files if there are no more files to conquer...
That's an interesting idea. I didn't consider multi-file parallelism, but it sounds like it might be a good approach when you don't know what level of single-file parallelism is appropriate.
This is a follow-up to my comment here.
My suggestion is to add in b3sum a progress percentage when calculating hashes. Currently it doesn't show anything until hash calculation is finished.
Something like RHash's
--percents
:Thank you