BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
5.14k stars 350 forks source link

Enhancement: b3sum add -c --check flag for verification #33

Closed tERyceNzAchE closed 4 years ago

tERyceNzAchE commented 4 years ago

Adding -c, --check would make b3sum a drop in replacement for b2sum in some scripts.

oconnor663 commented 4 years ago

Agreed, this would be a good feature. It's not high on my personal priority list, because I don't use --check myself, but I'd be happy to review a PR.

phayes commented 4 years ago

I would be happy to create a PR for this.

But first I think I would like @oconnor663's decision on #24 since that will determine if the hex parsing is in b3sum or in blake3

oconnor663 commented 4 years ago

Apologies for dragging my feet on these. I'm hoping I'll get to them by the end of the week.

phayes commented 4 years ago

@oconnor663 , no worries! It's your project - take your time. :)

oconnor663 commented 4 years ago

I decided against #24, so let's go with putting the hex parsing in b3sum. (Or for what it's worth, we could do constant-time hex comparisons instead of parsing.)

phayes commented 4 years ago

OK great. I'll have this done in the next day or two.

oconnor663 commented 4 years ago

Exciting! I know the https://github.com/jtdowney/b2sum-rs project has a --check implementation, in case that helps. I'm not sure whether everyone does this the same way, or whether we're going to have to make choices.

phayes commented 4 years ago

A PR is ready for review here: #44

llowrey commented 4 years ago

Is there interest in completing this feature? It looks like there are a few unresolved issues with PR #44 and those issues appear to be more policy (requirements) related than technical/implementation/defects.

I am a heavy user of "b2sum --check" and would like to use b3sum, for performance reasons. In one trivial use case I am able to check a dir of files in 42s with b3sum that takes 5m37s with b2sum. That's a big difference but atm I have to do the check out-of-band which is neither convenient or reliable (in my case).

I'd like to help get this feature released but not sure how to contribute.

oconnor663 commented 4 years ago

I'm definitely interested in the abstract. It just hasn't been a priority of mine. There were some policy questions that came up in that PR, which I think it might be a good idea to for us to try to answer before we write much more code. Let me see if I can list them here:

  1. Weird filepath support. Can filepaths contain e.g. newlines? Can they contain invalid UTF-8? "Do whatever Coreutils does here" could be a reasonable answer, but Coreutils actually does some very non-trivial things, and I'd want to make sure we really understand what those are.
  2. Windows support. I don't know whether "do whatever Coreutils does" will work for us here. Can filepaths contain Windows path separators (\)? Should they do so by default on Windows? If so, is an output produced on Windows verifiable on Unix? What happens to paths from Unix systems that contain \ as a filename character? (Such paths aren't even representable on Windows as far as I know, so tools like Zip and Git have to use hacky escaping workarounds. I don't think we have to make everything magically work, but I'd like to have a good idea of what will work and what won't.)
  3. Nonstandard hash parameters. Can we verify hashes that aren't 32 bytes? If so, should we auto-detect the hash length? Can we verify keyed hashes? My biggest concern here is that supporting short hashes could be a security problem.

For all of these things, I strongly lean towards "support as few things as possible now, such that we can choose to add features later as needed". Once we start encouraging people to start saving check files, it'll be painful to make any changes that break existing users.

oconnor663 commented 4 years ago

Maybe some of these questions can get swept under the rug as "we don't construct paths ourselves, we just get them from the command line"? I'm not totally sure. In Rust terms, command line args come to us as OsString, and there's no "unopinionated" conversion from OsString to String or Vec<u8>.

"Paths must be valid Unicode strings and may not contain newlines" could be a pretty good starting point, but even there I have a question: What does b3sum * do when the current directory contains a filename with a newline in it? Does it fail, or does it produce some output which --check cannot consume? If the latter, what error does --check report, and how does it detect this condition?

llowrey commented 4 years ago

Thank you for responding so quickly!

I suspect (hope) that the coreutils folks will adopt blake3 and create their own b3sum, with all the fancy rules they want. I liked you comment about aiming for a 90% solution at 10% the effort.

IMO, the contract must be that --check shall consume whatever b3sum produces as output. The output requirements should have priority and --check should do the best it can with whatever that is.

So...

  1. Weird filepath support. b3sum must accept valid filepaths, for the OS it's running on. I haven't looked into what b3sum does now on output but I would expect it would backslash escape CR, LF and backslashes. I expect the output to be OsString since the output must be readable to the user. If we force output to UTF-8 the user won't be able to read it correctly if they're not using UTF-8.
  2. Windows support. I believe it's the user's responsibility to deal with this. As you point out, a backslash could be present in a unix filename so the presence of a backslash cannot universally signal a windows path. This, to me, is part of the 90% solution approach.
  3. Nonstandard hash parameters. The secret key is not output by b3sum (good thing) and therefore will not be available to --check. The consequence of this is that the input to --check will not have keys therefore either no hash uses a secret key or all use the same key. The tricky part is that b3sum must consume hashes via stdin which is where the secret key comes from. I'm inclined to require the user to prepend the hash stream with the secret key raw bytes. Not trivial, but possible. So, when --check is invoked with --keyed, the first 32 bytes are assumed to be the secret key and the remainder of the input is the hash stream. The secret key would be applied to all hashes. Variable lengths would need to be supported. I understand the security concern but, imo, --check should consume whatever b3sum produces. I'd leave it to the user not to harm themselves.

Re hash length. I would prefer that the hash length be part of the hash such that a 16 byte hash would produce a different value from the first 16 bytes of a 32 byte hash. IOW, truncating the hash to 1 byte would not produce the same hash as when computing a 1 byte hash. This would address your security concern, which exists irrespective of whether --check is doing the verification or users are doing that verification out-of-band.

Would it help if I documented coreutils' behavior wrt filepathing?

oconnor663 commented 4 years ago

Ok I understand the md5sum (and other Coreutils [HASH]sum utilities) behavior better now, based on this StackExchange question and the docs that it links to:

Without --zero, if file contains a backslash or newline, the line is started with a backslash, and each problematic character in the file name is escaped with a backslash, making the output unambiguous even in the presence of arbitrary file names.

For example, a filename with a newline in the middle looks like this:

\d41d8cd98f00b204e9800998ecf8427e  abc\ndef

And a filename with a backslash in the middle looks like this:

\d41d8cd98f00b204e9800998ecf8427e  abc\\def

Every other character (or I should say every other byte, since invalid Unicode is allowed) is copied literally to the output.

A subtle point here is that this is backwards-compatible with older versions of md5sum that did not do this escaping. That means an old checkfile with a literal, unescaped backslash in one of the filenames will still work today, even if it's coincidentally followed by an n character, because unescaping is only done when the line starts with a backslash. (Though I assume unescaped newlines are still hopeless.)

Another subtle point here is that, if --zero isn't used, it's possible to have a null character in the middle of a filepath. I don't think it's possible to make md5sum emit such a filepath, but it's certainly possible to insert a null character in the output manually, and then we have to ask what happens if md5sum consumes that doctored output. As far as I can tell, the answer is that it ignores all characters after the null, probably because it passes the string directly to libc, which interprets the interior null as terminating. I think we should error out in this case, by explicitly checking parsed strings for interior null, rather than reproducing this behavior.


Summing all that up, I think "do exactly what md5sum does, but be more aggressive about panicking on interior nulls" could be a good policy on Unix. However, I think it's going to give us trouble on Windows.

The first problem is that Windows filepaths aren't "arbitrary bytes", but rather "arbitrary byte pairs" (that is, arbitrary 16-bit characters). If we converted byte pairs to bytes in the naive, literal way, then "abc" on Unix would correspond to "a\0b\0c\0" on Windows. That wouldn't be acceptable; even in the most trivial cases, check files wouldn't be portable between Unix and Windows. (And I'm not even going to guess at how a Windows terminal would display those bytes to the user.)

The more reasonable strategy is to parse Windows paths as UTF-16 and convert them to UTF-8, however this runs into the second problem: Windows paths aren't guaranteed to be well-formed UTF-16. They might contain unpaired "surrogate characters" which can't be represented in well-formed UTF-8. This is why "WTF-8" was defined, and it's essentially the entire reason the OsString type exists in Rust. (If the whole world was Unix, we could probably just use &[u8] and Vec<u8> everywhere we use OsStr and OsString today.)

It's tempting to say "ok, just use WTF-8 in the output," but the spec is quite adamant about this:

WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

And actually, "just use WTF-8" wouldn't even solve the problem, because while it can represent all Windows filepaths, it can't represent all Unix filepaths.


Given all that, here's what I propose:

(Edit: More important bullet points below regarding Windows.)

oconnor663 commented 4 years ago

Major oversight above: We obviously shouldn't escape the backslashes in C:\foo\bar.txt on Windows. Does anyone know what md5sum does on Windows? (I see on Google that there is such a thing as "Coreutils on Windows", but I've never used it.) Maybe we can solve the problem with two additional rules:

The goal there is that files with backslashes in their names can be hashed and checked on Unix, but attempting to check such a file on Windows is always an error. (The file cannot exist anyway, and we don't want the name fragments to unintentionally match existing directories.)

sneves commented 4 years ago

Backslashes are escaped on Windows:

C:\Users\John>md5sum.exe \\.\C:\Users\John\Desktop\b3sum_windows_x64_bin.exe
\1d54334dd99612a139fbb51736290e85 *\\\\.\\C:\\Users\\John\\Desktop\\b3sum_windows_x64_bin.exe
oconnor663 commented 4 years ago

Thanks, that makes sense. Looks like they don't have any special handling for Windows. I think we can do better, tentatively leaning towards the two rules I suggested above. Here's the scenario I'm imagining:

  1. A Windows user clones this repo.
  2. That user runs something like

    cd BLAKE3 ; b3sum README.md Cargo.toml src\lib.rs

    (note the backslash) and captures the output as a checkfile.

  3. They send the checkfile to a Unix user.
  4. The Unix user clones this repo too, and then runs b3sum --check on the checkfile.

I don't think we should go crazy with cross-platform compatibility, since non-Unicode paths tend to be totally unrepresentable on other platforms anyway. But "unzip an archive and hash the files in it" is the core use case of --check, and if all the filenames in there are well-behaved, it seems reasonable to make it work. If I understand correctly, it'll work today with md5sum if you go from Unix to Windows, but not if you go from Windows (with standard backslash-separated paths) to Unix.

Also, it would be nice if our default output on Windows wasn't super ugly :)

oconnor663 commented 4 years ago

By the way, in @sneves output above, I see a * character at the start of the path. What's that doing there? Is that some special feature of md5sum that I don't know about yet?

sneves commented 4 years ago

putchar (file_is_binary ? '*' : ' ');

Indicates whether the file is opened as fopen(name, "rb") or fopen(name, "r"). It should always be binary, since "r" mode will mess with newline characters and such across operating systems.

oconnor663 commented 4 years ago

Ok, yeah, I think it's safe to say b3sum will just always use binary mode, and never print the *.

sneves commented 4 years ago

To be clear, * indicates binary mode.

oconnor663 commented 4 years ago

Apparently it's "binary mode, if the difference is significant", which I assume means "only on Windows" in practice? Anyway, we'll always read raw bytes regardless of the OS, so I don't think we need to worry about it.

oconnor663 commented 4 years ago

Ooo, credit where credit is due, it looks like macOS does not allow invalid Unicode filenames!

oconnor663 commented 4 years ago

I'm working on this in the https://github.com/BLAKE3-team/BLAKE3/tree/check branch.

Looks like the Clap argument parsing library might have a bug on Windows that'll get in the way of testing this: https://github.com/clap-rs/clap/issues/1905. I have a PR up to fix that, and I might like to get it released first.

Update: Clap v2.33.1 has been released with a fix.

oconnor663 commented 4 years ago

I've finished a --check implementation and pushed it to master. Its exact behavior is documented here: https://github.com/BLAKE3-team/BLAKE3/blob/master/b3sum/what_does_check_do.md. I'd really appreciate it if folks could test it out, especially folks on Windows. The intention is that producing a checkfile on Unix and checking it on Windows should work, and vice versa.

oconnor663 commented 4 years ago

Basically as soon as someone tells me that they've tried this and it seems to work for them, I'm gonna release it :)

llowrey commented 4 years ago

This is awesome! Thank you very much. This is going to speed up several of my workflows substantially.

nice to have (from b2sum): --quiet don't print OK for each successfully verified file

oconnor663 commented 4 years ago

Thanks for kicking the tires!

Windows: Globbing does not work

T is expected. Globbing is done by the shell on Unix, and my understanding is that the Windows shell has never supported it, and every different program does its own thing (or nothing). That said, I don't know which tools go through the trouble of supporting globbing on Windows. It seems like it would create an awkward situation involving filepaths with literal * characters in their names...

Windows: Unicode text "Funky näme.txt" in looks like "Funky n├ñme.txt" in the output

Is there any right thing to do here? Presumably the current codepage can't even represent most characters?

nice to have (from b2sum): --quiet

That does sound nice to have, and it should be very quick to implement. Let me take a shot at it.

Relatedly, I'm curious if anyone actually uses --zero.

llowrey commented 4 years ago

Thanks for kicking the tires!

Windows: Globbing does not work

T is expected. Globbing is done by the shell on Unix, and my understanding is that the Windows shell has never supported it, and every different program does its own thing (or nothing). That said, I don't know which tools go through the trouble of supporting globbing on Windows. It seems like it would create an awkward situation involving filepaths with literal * characters in their names...

Windows doesn't allow literal * or ? in filenames (or \, /,:,", <, > or | ). So, it should be safe to use the following when run on Windows (assuming it does what I think it does).

https://docs.rs/glob/0.3.0/glob/

Windows: Unicode text "Funky näme.txt" in looks like "Funky n├ñme.txt" in the output

Is there any right thing to do here? Presumably the current codepage can't even represent most characters?

I just noticed that the output of b3sum looks right unless it's directed to something that isn't console. I also verified that notepad and notepad++ show the saved output correctly so are detecting UTF-8. It's really only the type command that doesn't 'do the right thing'. So, I'm a-ok with the behavior as-is.

nice to have (from b2sum): --quiet

That does sound nice to have, and it should be very quick to implement. Let me take a shot at it.

Relatedly, I'm curious if anyone actually uses --zero.

There's probably a tool out there that execs b2sum and doesn't want to be bothered with parsing the output. I don't know what it might be but I don't see a need to implement this option unless someone shows up asking for it.

sneves commented 4 years ago

The MSVC C runtime is already perfectly able to handle globbing on its own, however this is not enabled by default. By linking against setargv.obj, or by defining

extern "C" _crt_argv_mode __CRTDECL _get_startup_argv_mode()
{
    return _crt_argv_expanded_arguments;
}

oneself, globbing becomes transparently enabled.

So with some special linker incantations, it could be possible to enable globbing on Windows without any extra code.

EDIT: But it isn't, because Rust ignores the argc and argv from the C runtime, and calls GetCommandLineW itself, along with all the parsing.

oconnor663 commented 4 years ago

Interesting. Reading through this thread, I found a link to the wild crate, which looks to be for exactly this problem. I'll try it out.

Edit: Success! The wild crate makes this a one-line change. https://github.com/BLAKE3-team/BLAKE3/commit/48512ec4f06cffaa3b3fba7af344f4924c1e6b10

oconnor663 commented 4 years ago

Shipped as part of version 0.3.4! Globbing and --quiet support included.

vans163 commented 2 years ago

Not sure if this is a bug or a feature but it makes it unable to be a drop-in replacement for sha256sum.

using --check --quiet requires 2 spaces (\n\n) between the filename and hash otherwise we get Invalid space error.

root@node0:/tmp# b3sum --check --quiet <<< "0b8b60248fad7ac6dfac221b7e01a8b91c772421a15b387dd1fb2d6a94aee438 hi"
b3sum: Invalid space

root@node0:/tmp# b3sum --check --quiet <<< "0b8b60248fad7ac6dfac221b7e01a8b91c772421a15b387dd1fb2d6a94aee438 
 hi"