b3sum: digest checksum for multiple files

daviessm commented 3 years ago

I have a project with a requirement to output a hash for each file and an overall hash of the whole set of files at the end. This could be achieved with b3sum file1 file2 file3 > filesums; b3sum filesums but a built-in option would be nice too.

I started work on this in https://github.com/daviessm/BLAKE3/commit/d6be56658ec8c0f3075bb3254bc248026aa38a60 but as a Rust newbie I don't know if I'm going down the right path with this.

oconnor663 commented 3 years ago

Because OutputReader impelments std::io::Read and Hasher implements std::io::Write, you can replace your block loop here with a combination of Read::take and std::io::copy. (copy does use a relatively small 8 KiB internal buffer, and Hasher benefits from a larger one if your processor supports AVX-512, but I don't think this is an important optimization here.)

That said, there might not be any particular reason to use extended output internally here. It probably makes more sense to just use the default 32-byte output from each individual file, and to only apply the --length parameter for the final hash at the end. (Abstracting this away from b3sum and imagining a library API, we'd also need to take this approach if we wanted to expose the final output as an XOF stream, where the number of bytes that'll be read from the stream isn't known up front.)

A higher-level design question: Should this sort of thing include the file names in the hash? I think it probably has to. For one thing, the order of the hashes depends on the filenames, so at least some filename changes will change the final output. Also users of this feature presumably would want to use it to prove that an entire directory tree is as it should be, but being able to change (some) filenames without changing the hash would seriously break that use case.

Thinking about how to unambigously represent a tree structure as a stream of bytes brings up some interesting questions. There are simple ways to do it, JSON for example. But with hashing in the picture, we often want to be sure that we're going to get a stable, canonical byte stream, which isn't something that JSON provides. We've had a couple of threads before about this "structured hashing" question (or whatever I should be calling it):

Overall I think it's a very interesting and useful problem to solve, but that it's much trickier than it seems at first glance. It brings up security questions, as well as questions of backwards compatibility and protocol design. I'd like to try to tackle it at some point, but I think that'll end up being a nontrivial project in its own right.

So, coming back down to earth a bit, we probably don't want to solve that general problem in b3sum. To be fair, our file hashes are effectively (string, string) pairs, and hashing those is can be simpler than solving the general question. We could probably just concatenate them all, using length suffixes or something like that to disambiguate the boundaries. However, we'd need to think about all the same Path <---> String invalid Unicode issues that come up in the --check feature.

Thoughts?

daviessm commented 3 years ago

Because OutputReader impelments std::io::Read and Hasher implements std::io::Write, you can replace your block loop here with a combination of Read::take and std::io::copy. (copy does use a relatively small 8 KiB internal buffer, and Hasher benefits from a larger one if your processor supports AVX-512, but I don't think this is an important optimization here.)

That said, there might not be any particular reason to use extended output internally here. It probably makes more sense to just use the default 32-byte output from each individual file, and to only apply the --length parameter for the final hash at the end. (Abstracting this away from b3sum and imagining a library API, we'd also need to take this approach if we wanted to expose the final output as an XOF stream, where the number of bytes that'll be read from the stream isn't known up front.)

I can see this making sense both ways: on one hand a 32-byte output is probably sufficient but there will undoubtedly be times where project requirements dictate that a larger length is used (I'm imagining governmental security related projects here). Of course, there would be no way to know what the internal length of the digest is so perhaps the point is moot.

A higher-level design question: Should this sort of thing include the file names in the hash? I think it probably has to. For one thing, the order of the hashes depends on the filenames, so at least some filename changes will change the final output. Also users of this feature presumably would want to use it to prove that an entire directory tree is as it should be, but being able to change (some) filenames without changing the hash would seriously break that use case.

Again, I can see it both ways. In most cases the file names will be important and therefore should be included in the digest, but what if one of the times b3sum needs to be run is on a case-insensitive filesystem that always returns filenames in lower/upper case? Is that even a thing? Perhaps there should be an option whether to include file names or not.

Thinking about how to unambigously represent a tree structure as a stream of bytes brings up some interesting questions. There are simple ways to do it, JSON for example. But with hashing in the picture, we often want to be sure that we're going to get a stable, canonical byte stream, which isn't something that JSON provides. We've had a couple of threads before about this "structured hashing" question (or whatever I should be calling it):
* #87

* #100
Overall I think it's a very interesting and useful problem to solve, but that it's much trickier than it seems at first glance. It brings up security questions, as well as questions of backwards compatibility and protocol design. I'd like to try to tackle it at some point, but I think that'll end up being a nontrivial project in its own right.

Why do we need to consider it as a tree structure? My thought was to simply add the files to the digest in the order they're generated. Perhaps with the parallelism issue from #170 the hashes would internally be stored in a tree map as they're generated in order to preserve the original ordering, then the tree walked and file hashes appended to the digest sequentially after all files had individually been hashed.

So, coming back down to earth a bit, we probably don't want to solve that general problem in b3sum. To be fair, our file hashes are effectively (string, string) pairs, and hashing those is can be simpler than solving the general question. We could probably just concatenate them all, using length suffixes or something like that to disambiguate the boundaries. However, we'd need to think about all the same Path <---> String invalid Unicode issues that come up in the --check feature.

My use-case would be satisfied with being able to pipe the output of b3sum file1 file2 file3 (or, from #170, b3sum -r dir_to_hash) to the input of another b3sum as a string to generate a hash for. Of course it would look nicer (to me) if it was all possible in one command 🙂

oconnor663 commented 3 years ago

there will undoubtedly be times where project requirements dictate that a larger length is used (I'm imagining governmental security related projects here)

This is a common point of confusion, and maybe not documented well enough. Output lengths larger than the 32-byte default actually don't make the hash more secure. The underlying reason for this is that output size is one of the caps on security (thus making it shorter will reduce security), but it's not the only cap. Security is also capped by the amount of internal state we carry forward between invocations of the compression function, which is always 32-byte, regardless of the target output length (which might not even be known at compression time).

Rudxain commented 2 years ago

Why not use tar? Or am I missing something? Do different implementations of tar have implementation diffs that cause tarballs to be slightly different even if they were generated from the same dir?

daviessm commented 2 years ago

For my original use-case it's to prove that files have not been modified during a transfer between systems. Adding them to an archive obviously does change the files; as such we need to generate the checksums in the original file structure.

Rudxain commented 2 years ago

Adding them to an archive obviously does change the files; as such we need to generate the checksums in the original file structure

I see now. Thank you for clarifying.

polarathene commented 2 years ago

Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use find and pair it with sort to ensure a deterministic list of checksums to pass to the 2nd b3sum as input from stdin (1st b3sum call is provided with N filepaths).

find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names

That will recursively search for all files in the relative path ./target-folder and append them to the -exec command b3sum

The + after {} is to ensure all file paths are provided to a single b3sum call, instead of multiple calls to b3sum for each indivdual file path returned by find.

Original version (more verbose)

```bash find ./target-folder -type f -exec b3sum file1 another/file {} + \ | LC_ALL=C sort \ | b3sum \ | awk '{ print $1 }' ``` `awk` just extracts the digest value from the final output (_alternatively use `b3sum --no-names`_). This example also demonstrates inclusion of two static filepaths already added from other locations outside of `./target-folder` if relevant (_I'm using it for deriving a cache key for CI, based on partial contents of a git checkout_). The rest of the functionality is described well in [this article](https://www.baeldung.com/linux/directory-md5-checksum) that I referenced, especially useful with the `LC_ALL=C` for `sort`. I'm not sure if `LC_ALL=C` is necessary since sorting would be on hexadecimal chars only due to the digest being first. I know that it's [more relevant with sorting by filepaths](https://github.com/strimzi/strimzi-kafka-operator/issues/6899), but perhaps some locales have a different sort order between numeric and alpha characters?

gregl83 commented 1 year ago

I had a similar requirement but only needed the final hash.

Wrote paq using the Blake3 crate.

Supports Linux, MacOS, and Windows.

xczh commented 1 year ago

Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use find and pair it with sort to ensure a deterministic list of checksums to pass to the 2nd b3sum as input from stdin (1st b3sum call is provided with N filepaths).
find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names
That will recursively search for all files in the relative path ./target-folder and append them to the -exec command b3sum

The + after {} is to ensure all file paths are provided to a single b3sum call, instead of multiple calls to b3sum for each indivdual file path returned by find.

Original version (more verbose)

Very practical usage, thank you! But is there a similar command on Windows? @polarathene

polarathene commented 1 year ago

But is there a similar command on Windows?

I don't use Windows, but you can probably find a similar command to do the same:

Get all file names/paths recursively from a location.
Run b3sum on each one
Sort the output
b3sum the sorted output

You could maybe try with WSL?

Or that paq tool linked above should accomplish roughly the same? (just try download and run the windows x64 release file) The paq README describes basically the same steps that the command I shared was performing.

TilKenneth commented 1 month ago

Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use find and pair it with sort to ensure a deterministic list of checksums to pass to the 2nd b3sum as input from stdin (1st b3sum call is provided with N filepaths).
find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names
That will recursively search for all files in the relative path ./target-folder and append them to the -exec command b3sum

The + after {} is to ensure all file paths are provided to a single b3sum call, instead of multiple calls to b3sum for each indivdual file path returned by find.

Original version (more verbose)

From the future here, to the next future.

The double space requirement between checksum and file got me here. I have blake3 checksums from TeraCopy and the files transfered from Windows to Linux over the internet. In this case I used SyncThing, no rsync, no robocopy and no Samba/Windows share.

It is inconvenient to be in between 2 tools that just dont work with the same data, and then some glue has to be constructed, because.

what	command	output
mydir	`find mydir -type f -exec b3sum {} + > mydir.b3`	`3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 mydir/myfile.txt`
./mydir	`find ./mydir -type f -exec b3sum {} + > mydir.b3`	`3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 ./mydir/myfile.txt`

From pwsh and Windows generated .blake3 etc (TeraCopy) on

Unexpected input:

3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 *myfile.txt
- "\x20\x2A" aka * aka space followed by star.
3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 *mydir\myfile.txt
- "\x20\x2A" as above, with wrong (\) directory separator.

*On Windows a filesystem name entry containing or / \ should not be legal.**

Be mindful of \\.\c:\very-long-long-path (>MAX_PATH), \\machine\share
On Windows Powershell (powershell.exe) uses System.Text.ASCIIEncoding, whereas
On Microsoft Powershell (pwsh.exe) uses System.Text.UTF8Encoding+UTF8EncodingSealed
- The "it just works".
- Theres a PSPath of Microsoft.PowerShell.Core\FileSystem::/home/username/mydir/myfile of (Get-Item $HOME).PSPath

Syntax note: | % { $_ } is short-hand for | ForEach-Object { $_ }, where $_ is the equivalent of $item in foreach($item in $collection) { }. | is the expected pipe.

$blakefile = "windows.teracopy.blake3"
# Convert, foreach line send to file.
Get-Content $blakefile | % { $_.Replace('\','/').Replace(' *','  ') } | Out-File -Append -File "$blakefile.b3"

# Convert, foreach line  pipe to b3sum.
Get-Content $blakefile | % { $_.Replace('\','/').Replace(' *','  ') } |

# Pipe direct to b3sum, read entire $blakefile using -Raw.
Get-Content $blakefile -Raw | % { $_.Replace('\','/').Replace(' *','  ') } | b3sum --check

# If you want there is Tee-Object, send to file and to b3sum
Get-Content $blakefile -Raw | % { $_.Replace('\','/').Replace(' *','  ') } | Tee-Object -FilePath "$blakefile.b3" | b3sum --check --quiet

In case you need to tinker with checksum files thats not the right location for b3sum to find.

Get-ChildItem -File -Recurse | %{ $.FullName } and the $.FullName property, and I'm sure there is a resolve relative path

Resolve-Path -Path "/home/username/mydir/myfile" -RelativeBasePath "/home/username/" -Relative

# For convenience.
$cwd = "/home/username/"
$fullpath = "/home/username/mydir/myfile"
$remove = $cwd
Resolve-Path -Path "$fullpath" -RelativeBasePath "$remove" -Relative

# ./mydir/myfile

polarathene commented 1 month ago

From the future here, to the next future.

The double space requirement between checksum and file got me here.

Sorry could you be ab it more clear on what your comment was about? I can only infer the Windows adaptation?

I mean I understand the path separator difference with Windows:

"\x20\x2A" as above, with wrong (\) directory separator.

On Windows a filesystem name entry containing * or / \ should not be legal.

But the checksums being the same that you provided whilst having different paths was your concern?

I haven't gone back through the prior history of this issue, but IIRC you have a content hash of the file, and a reference to the associated file in the 2nd column.

So I don't see an issue with paths differing while keeping the same digest, could you clarify the specific problem there? Are you trying to use those paths afterwards?

Path separators differing between windows and linux isn't always a concern, at least with Rust I recall it having good support to process / convert both. On Windows you can have WSL2 and access content on Windows or WSL2 from either end where obviously the path separator differs despite filesystem boundaries being crossed.

oconnor663 commented 1 month ago

This is a tangent from recent comments, but I want to point out to folks who might be following along that v1.5.0 of the blake3 Rust crate added the Hasher::update_mmap_rayon method, which is the same as what b3sum does internally. So if you want to use e.g. walkdir to code a directory hash, it's easier than it used to be.

BLAKE3-team / BLAKE3

b3sum: digest checksum for multiple files #171