Open daviessm opened 3 years ago
Because OutputReader
impelments std::io::Read
and Hasher
implements std::io::Write
, you can replace your block loop here with a combination of Read::take
and std::io::copy
. (copy
does use a relatively small 8 KiB internal buffer, and Hasher
benefits from a larger one if your processor supports AVX-512, but I don't think this is an important optimization here.)
That said, there might not be any particular reason to use extended output internally here. It probably makes more sense to just use the default 32-byte output from each individual file, and to only apply the --length
parameter for the final hash at the end. (Abstracting this away from b3sum
and imagining a library API, we'd also need to take this approach if we wanted to expose the final output as an XOF stream, where the number of bytes that'll be read from the stream isn't known up front.)
A higher-level design question: Should this sort of thing include the file names in the hash? I think it probably has to. For one thing, the order of the hashes depends on the filenames, so at least some filename changes will change the final output. Also users of this feature presumably would want to use it to prove that an entire directory tree is as it should be, but being able to change (some) filenames without changing the hash would seriously break that use case.
Thinking about how to unambigously represent a tree structure as a stream of bytes brings up some interesting questions. There are simple ways to do it, JSON for example. But with hashing in the picture, we often want to be sure that we're going to get a stable, canonical byte stream, which isn't something that JSON provides. We've had a couple of threads before about this "structured hashing" question (or whatever I should be calling it):
Overall I think it's a very interesting and useful problem to solve, but that it's much trickier than it seems at first glance. It brings up security questions, as well as questions of backwards compatibility and protocol design. I'd like to try to tackle it at some point, but I think that'll end up being a nontrivial project in its own right.
So, coming back down to earth a bit, we probably don't want to solve that general problem in b3sum
. To be fair, our file hashes are effectively (string, string) pairs, and hashing those is can be simpler than solving the general question. We could probably just concatenate them all, using length suffixes or something like that to disambiguate the boundaries. However, we'd need to think about all the same Path <---> String invalid Unicode issues that come up in the --check
feature.
Thoughts?
Because
OutputReader
impelmentsstd::io::Read
andHasher
implementsstd::io::Write
, you can replace your block loop here with a combination ofRead::take
andstd::io::copy
. (copy
does use a relatively small 8 KiB internal buffer, andHasher
benefits from a larger one if your processor supports AVX-512, but I don't think this is an important optimization here.)That said, there might not be any particular reason to use extended output internally here. It probably makes more sense to just use the default 32-byte output from each individual file, and to only apply the
--length
parameter for the final hash at the end. (Abstracting this away fromb3sum
and imagining a library API, we'd also need to take this approach if we wanted to expose the final output as an XOF stream, where the number of bytes that'll be read from the stream isn't known up front.)
I can see this making sense both ways: on one hand a 32-byte output is probably sufficient but there will undoubtedly be times where project requirements dictate that a larger length is used (I'm imagining governmental security related projects here). Of course, there would be no way to know what the internal length of the digest is so perhaps the point is moot.
A higher-level design question: Should this sort of thing include the file names in the hash? I think it probably has to. For one thing, the order of the hashes depends on the filenames, so at least some filename changes will change the final output. Also users of this feature presumably would want to use it to prove that an entire directory tree is as it should be, but being able to change (some) filenames without changing the hash would seriously break that use case.
Again, I can see it both ways. In most cases the file names will be important and therefore should be included in the digest, but what if one of the times b3sum
needs to be run is on a case-insensitive filesystem that always returns filenames in lower/upper case? Is that even a thing? Perhaps there should be an option whether to include file names or not.
Thinking about how to unambigously represent a tree structure as a stream of bytes brings up some interesting questions. There are simple ways to do it, JSON for example. But with hashing in the picture, we often want to be sure that we're going to get a stable, canonical byte stream, which isn't something that JSON provides. We've had a couple of threads before about this "structured hashing" question (or whatever I should be calling it):
* #87 * #100
Overall I think it's a very interesting and useful problem to solve, but that it's much trickier than it seems at first glance. It brings up security questions, as well as questions of backwards compatibility and protocol design. I'd like to try to tackle it at some point, but I think that'll end up being a nontrivial project in its own right.
Why do we need to consider it as a tree structure? My thought was to simply add the files to the digest in the order they're generated. Perhaps with the parallelism issue from #170 the hashes would internally be stored in a tree map as they're generated in order to preserve the original ordering, then the tree walked and file hashes appended to the digest sequentially after all files had individually been hashed.
So, coming back down to earth a bit, we probably don't want to solve that general problem in
b3sum
. To be fair, our file hashes are effectively (string, string) pairs, and hashing those is can be simpler than solving the general question. We could probably just concatenate them all, using length suffixes or something like that to disambiguate the boundaries. However, we'd need to think about all the same Path <---> String invalid Unicode issues that come up in the--check
feature.
My use-case would be satisfied with being able to pipe the output of b3sum file1 file2 file3
(or, from #170, b3sum -r dir_to_hash
) to the input of another b3sum
as a string to generate a hash for. Of course it would look nicer (to me) if it was all possible in one command 🙂
there will undoubtedly be times where project requirements dictate that a larger length is used (I'm imagining governmental security related projects here)
This is a common point of confusion, and maybe not documented well enough. Output lengths larger than the 32-byte default actually don't make the hash more secure. The underlying reason for this is that output size is one of the caps on security (thus making it shorter will reduce security), but it's not the only cap. Security is also capped by the amount of internal state we carry forward between invocations of the compression function, which is always 32-byte, regardless of the target output length (which might not even be known at compression time).
Why not use tar
? Or am I missing something? Do different implementations of tar
have implementation diffs that cause tarballs to be slightly different even if they were generated from the same dir?
For my original use-case it's to prove that files have not been modified during a transfer between systems. Adding them to an archive obviously does change the files; as such we need to generate the checksums in the original file structure.
Adding them to an archive obviously does change the files; as such we need to generate the checksums in the original file structure
I see now. Thank you for clarifying.
Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use find
and pair it with sort
to ensure a deterministic list of checksums to pass to the 2nd b3sum
as input from stdin (1st b3sum
call is provided with N filepaths).
find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names
That will recursively search for all files in the relative path ./target-folder
and append them to the -exec
command b3sum
The +
after {}
is to ensure all file paths are provided to a single b3sum
call, instead of multiple calls to b3sum
for each indivdual file path returned by find
.
I had a similar requirement but only needed the final hash.
Wrote paq using the Blake3
crate.
Supports Linux, MacOS, and Windows.
Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use
find
and pair it withsort
to ensure a deterministic list of checksums to pass to the 2ndb3sum
as input from stdin (1stb3sum
call is provided with N filepaths).find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names
That will recursively search for all files in the relative path
./target-folder
and append them to the-exec
commandb3sum
The
+
after{}
is to ensure all file paths are provided to a singleb3sum
call, instead of multiple calls tob3sum
for each indivdual file path returned byfind
.Original version (more verbose)
Very practical usage, thank you! But is there a similar command on Windows? @polarathene
But is there a similar command on Windows?
I don't use Windows, but you can probably find a similar command to do the same:
You could maybe try with WSL?
Or that paq
tool linked above should accomplish roughly the same? (just try download and run the windows x64 release file) The paq
README describes basically the same steps that the command I shared was performing.
Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use
find
and pair it withsort
to ensure a deterministic list of checksums to pass to the 2ndb3sum
as input from stdin (1stb3sum
call is provided with N filepaths).find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names
That will recursively search for all files in the relative path
./target-folder
and append them to the-exec
commandb3sum
The
+
after{}
is to ensure all file paths are provided to a singleb3sum
call, instead of multiple calls tob3sum
for each indivdual file path returned byfind
.Original version (more verbose)
From the future here, to the next future.
The double space requirement between checksum and file got me here. I have blake3 checksums from TeraCopy and the files transfered from Windows to Linux over the internet. In this case I used SyncThing, no rsync, no robocopy and no Samba/Windows share.
It is inconvenient to be in between 2 tools that just dont work with the same data, and then some glue has to be constructed, because.
what | command | output |
---|---|---|
mydir | find mydir -type f -exec b3sum {} + > mydir.b3 |
3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 mydir/myfile.txt |
./mydir | find ./mydir -type f -exec b3sum {} + > mydir.b3 |
3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 ./mydir/myfile.txt |
From pwsh and Windows generated .blake3 etc (TeraCopy) on
Unexpected input:
3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 *myfile.txt
*
aka space followed by star.3de3577a7c26681bdae91dbd9b8dcfebaef16fd2b3b82d3369b230feadf51961 *mydir\myfile.txt
\
) directory separator.*On Windows a filesystem name entry containing or / \ should not be legal.**
\\.\c:\very-long-long-path
(>MAX_PATH), \\machine\share
Microsoft.PowerShell.Core\FileSystem::/home/username/mydir/myfile
of (Get-Item $HOME).PSPath
Syntax note: | % { $_ }
is short-hand for | ForEach-Object { $_ }
, where $_
is the equivalent of $item
in foreach($item in $collection) { }
. |
is the expected pipe.
$blakefile = "windows.teracopy.blake3"
# Convert, foreach line send to file.
Get-Content $blakefile | % { $_.Replace('\','/').Replace(' *',' ') } | Out-File -Append -File "$blakefile.b3"
# Convert, foreach line pipe to b3sum.
Get-Content $blakefile | % { $_.Replace('\','/').Replace(' *',' ') } |
# Pipe direct to b3sum, read entire $blakefile using -Raw.
Get-Content $blakefile -Raw | % { $_.Replace('\','/').Replace(' *',' ') } | b3sum --check
# If you want there is Tee-Object, send to file and to b3sum
Get-Content $blakefile -Raw | % { $_.Replace('\','/').Replace(' *',' ') } | Tee-Object -FilePath "$blakefile.b3" | b3sum --check --quiet
In case you need to tinker with checksum files thats not the right location for b3sum to find.
Get-ChildItem -File -Recurse | %{ $.FullName } and the $.FullName property, and I'm sure there is a resolve relative path
Resolve-Path -Path "/home/username/mydir/myfile" -RelativeBasePath "/home/username/" -Relative
# For convenience.
$cwd = "/home/username/"
$fullpath = "/home/username/mydir/myfile"
$remove = $cwd
Resolve-Path -Path "$fullpath" -RelativeBasePath "$remove" -Relative
# ./mydir/myfile
From the future here, to the next future.
The double space requirement between checksum and file got me here.
Sorry could you be ab it more clear on what your comment was about? I can only infer the Windows adaptation?
I mean I understand the path separator difference with Windows:
"\x20\x2A" as above, with wrong (
\
) directory separator.On Windows a filesystem name entry containing
*
or/
\
should not be legal.
But the checksums being the same that you provided whilst having different paths was your concern?
I haven't gone back through the prior history of this issue, but IIRC you have a content hash of the file, and a reference to the associated file in the 2nd column.
So I don't see an issue with paths differing while keeping the same digest, could you clarify the specific problem there? Are you trying to use those paths afterwards?
Path separators differing between windows and linux isn't always a concern, at least with Rust I recall it having good support to process / convert both. On Windows you can have WSL2 and access content on Windows or WSL2 from either end where obviously the path separator differs despite filesystem boundaries being crossed.
This is a tangent from recent comments, but I want to point out to folks who might be following along that v1.5.0 of the blake3
Rust crate added the Hasher::update_mmap_rayon
method, which is the same as what b3sum
does internally. So if you want to use e.g. walkdir
to code a directory hash, it's easier than it used to be.
I have a project with a requirement to output a hash for each file and an overall hash of the whole set of files at the end. This could be achieved with
b3sum file1 file2 file3 > filesums; b3sum filesums
but a built-in option would be nice too.I started work on this in https://github.com/daviessm/BLAKE3/commit/d6be56658ec8c0f3075bb3254bc248026aa38a60 but as a Rust newbie I don't know if I'm going down the right path with this.