Closed pedrocr closed 7 years ago
I'd consider this unwise based on previous discussions.
Right now, std::hash::Hasher
does not address endianness, making it non-portable. It's okay for HashMap
because you can only serialize HashMap
as a series of tuples, not as the raw table. It's already not okay for more lossy probabilistic data structures, like a Cuckoo table or Bloom filter, that you likely serialize as the table itself. It's flat wrong for cryptographic hash functions that one uses almost exclusively in protocols or files.
Conversely, all these cryptographic hash functions would destroy the performance of HashMap
if used there, so avoid doing that too. SipHasher
is already designed to be a sweat spot between being fast and providing the sort of cryptographic assurances HashMap
needs when under DoS attacks.
That said, one could provide a version using roughly the endian-hasher crate, unless its horribly broken.
use std::hash::{Hash, Hasher};
use endian_hasher::*;
struct DigestHasher<'a, T: 'a + ?Sized>(&'a mut T);
impl<'a, T: ?Sized + Input> Hasher for DigestHasher<'a, T> {
fn finish(&self) -> u64 { panic!(); }
fn write(&mut self, bytes: &[u8]) { self.0.process(bytes); }
}
pub trait Input {
/// Digest input data. This method can be called repeatedly
/// for use with streaming messages.
fn process(&mut self, input: &[u8]);
/// Provide data from anything that implements `Hash`, attempting to converting embedded primitive numeric types to little endian.
fn input_le<H: Hash>(&mut self, hashable: &H) {
let mut s = HasherToLE(DigestHasher(self));
hashable.hash(&mut s);
}
/// Provide data from anything that implements `Hash`, attempting to convert embedded primitive numeric types to big endian.
fn input_be<H: Hash>(&mut self, hashable: &H) {
let mut s = HasherToBE(DigestHasher(self));
hashable.hash(&mut s);
}
/// Provide data from anything that implements `Hash`, leaving primitive numeric types in native endianness.
///
/// Warning: Non-portable, do not use without first correcting the endianness of the input.
fn input_native<H: Hash>(&mut self, hashable: &H) {
let mut s = DigestHasher(self);
hashable.hash(&mut s);
}
}
As previously, we'd keep DigestHasher
private to the digest crate so that it cannot be used with HashMap
. And we'd make the finish
method panic for good measure.
There is still an issue here in that we trust H: Hash
to handle primitive numeric types by calling Hash
instances recursively, which it might fail to do. We could lay the blame for such portability bugs with whoever wrote H: Hash
perhaps, but they sound tricky to find.
I could hack up a pull request for this if @newpavlov thinks it wise. I think the first question is : Should the endian-hasher
crate live or die?
I'm tired right now, but I think argument against my alternative goes roughly : You should document your protocols and file formats better by using a more explicit serialization. That could just be a warning somewhere.
Sorry to be this blunt, but this is a terrible idea.
The standard interface only allows 64bit outputs but there's nothing stopping extra outpust tailored to specific hashes
What? The trait definition stops this, because types. Nor is there any standard notion of how to map a cryptographic hash function's digest onto a 64-bit integer.
But beyond that, cryptographic hash functions utilized by std::hash::Hasher
should be PRFs (i.e. keyed by a secret unguessable by an attacker), because hashDoS. There is no point in using a hash function which is both slow (because crypto) and not a PRF (because security).
The fundamental tradeoff is:
1) A hash function that is fast as possible but doesn't hash attacker-controlled data. This omits every cryptographic hash function, because they are slow because they act as random oracles. 2) A cryptographic PRF keyed by unguessable data to prevent hashDoS.
To me, if your goal is #2, anything slower than SipHash is unacceptable, because you still want a std::hash::Hasher
to be as fast as possible given other constraints.
Do you disagree? Can you explain the use case where you want something that's simultaneously slow and insecure?
@burdges the endianess issues are indeed a problem. If there's no way to fix those it's unfortunate. I'm not sure if I understand why this happens. Are primitive types hashing by feeding the hasher one u8 at a time? Because the Hasher
interface includes everything from u8 to u128 so it would seem that as long as Hash implementations use the full type it should be fine. Why doesn't that work?
I'll have a look if your suggestion works for my use case.
@tarcieri What? The trait definition stops this, because types. Nor is there any standard notion of how to map a cryptographic hash function's digest onto a 64-bit integer.
Nothing stops me from doing a .finish_256bit()
and have that give me SHA256 of my struct. It's not part of the trait but I don't care I just want to be able to use the Hash
API as the plumbing to get all the values to the Hasher
.
You're assuming I want to use the normal Hasher API with SHA256 to use in a Hashmap or something like that. That would be a bad idea indeed, SipHash is strictly better in that application. What I need is a longer hash that I can use to uniquely identify a certain struct
that I can use over time to identify that struct for caching and serialization. Think of how git uses hashes, content addresses, not how hash tables use hashes. Here's my current code:
I generate hashes that represent the state of the image pipeline up to that point as a function of it's settings. I'm currently using MetroHash but I want at least 128bit and a crypto hash for this. I'm only using the Hash
trait for convenience of #derive(Hash)
and friends. But maybe I just need a #derive(CryptoHash)
.
It worse so long as people use Hasher
recursively, but if any impl
of Hash
casts a &[u64]
to a &[u8]
then it'll break. And someone might wind up doing this for performance if they mistakenly examine performance in unoptimized builds.
You can always locally employ a version of DigestHasher
wrapper that makes a cryptographic hash function look like a Hasher
, either with or without my endianness fix. I think the question is if my input_le
and input_be
functions make sense.
In fact, I doubt #[derive(Hash)]
has stabilized field order, which provides a much better argument against adding my input_le
and input_be
functions to the Input
trait. You could warn people that they should implement Hash
themselves, not derive it.
I think those are the arguments against input_le
and input_be
: Bad impl
s for Hash
break them. #[derive(Hash)]
might not have stabilized field order. And you can document your protocol better by using a more explicit serialization. Iffy but not necessarily fatal. I donno.
@burdges It seems the standard implementation does just that:
So it seems Hash
is just too fundamentally broken for the serialization use case. Is there any derivable trait somewhere that can help with that? Something that just recursively hashes the underlying bytes of all types which a fixed endianness would be enough.
Yikes, I completely missed that! I'm tempted to raise that as a libs issue. In any case, I suppose serde should be considered the right way to do this for now.
@burdges That's very interesting how would you use serde for this? All my structs already implement Serialize
and Deserialize
. There's a standard way to just feed that to SHA256 without first going to YAML or something like that?
@pedrocr perhaps something like this is what you want:
https://github.com/cryptosphere/objecthash-rs
Note that I never wrote a procedural macro for it. I also intend on doing something slightly different and a bit more fleshed out soon.
@tarcieri yes, this is precisely my use case, thanks. Wouldn't it be easier to just write a serde Serializer
that feeds a Digest
? That way #derive(Serialize)
is all that's needed and any crypto hash can be used?
@pedrocr it may be possible to finagle it through serde visitors.
I would advise against using it for now though. I'll have a better library out soon.
@tarcieri it seems objecthash solves a bigger problem of having redactable structs. For me that's not an issue so maybe just writing a Serializer
is simpler.
Yes, even using the bicode
crate still involves an allocation, but one could write a Serializer
for digests to avoid allocation, maybe even one that agreed with the results of bincode
.
As an aside, I'm pretty happy doing all my cryptographic hashing manually so far because frequently if I need a cryptographic hashes then it actually needs data from multiple sources or maybe should not cover all the data present in the struct
provided. YMMV
@pedrocr as is, the Rust implementation doesn't support redaction, but yes, redaction/Merkle inclusion proofs are one of the nice features you get out of a hashing scheme like that.
@burdges cool, I'll have a look at this then. Would a PR with something like an adapter between Serializer
and Digest
make sense?
As for doing it manually I was hoping to do it automatically as these structs are specifically designed to be all the settings for the image operation and nothing more. We'll see :)
I found an easy way : You impl
io::Write
for a local wrapper struct Inputer<'a, I: 'a + ?Sized>(&'a mut I) where I: Input;
and provide a method with a Serialize
argument that calls bincode::serialize_into
the Inputer
wrapper.
It'll look vaguely like what I wrote above, but using io::Write
instead of Hasher
and serde::Serialize
instead of Hash
. Now serialize_into
should avoid the Vec
allocation in bincode::serialize
.
Also, your original question could now be : Why not implement io::Write
? Or could Input
be replaced with io::Write
even?
@burdges sounds good, avoids having to write a full Serializer
. Does do more indirection though.
@burdges also is bincode guaranteed to always send the same bytes? Or can it do things like reorder fields between versions? It may be safer to just implement the Serializer
to guarantee no funny business happens that invalidates the hash.
@pedrocr you seem to be running afoul of the canonicalization problem. One of the nice things about objecthash-like schemes is you don't need to canonicalize prior to hashing. Structurally similar objects will always have the same hash.
@tarcieri see this for my conclusions on the topic:
https://internals.rust-lang.org/t/f32-f64-should-implement-hash/5436/33
My worry in this case is only that the same exact struct will hash differently because bincode changes. Making sure that reordering fields doesn't change the hash for example isn't as important. Since my structs already implement Serialize
using that would mean one less derive to worry about.
Having a crate that does derive and allows me to use any hash would be nice though so please do ping me once you have something to test. The other features can't hurt either.
Yes, serde explicitly supports non-self-describing formats, naming bincode as the example.
There are various instructions for impl
s of Serialize
and Deserialize
to support non-self-describing formats, so those can break bincode of course.
Also bincode never emits any thing about the fields besides lengths and contents. I'd imagine bincode would use a feature or a major version update if they broke binary comparability, but who knows.
@burdges so hashing bincode does seem fairly reasonable. Will have a look at that then.
@pedrocr
Sorry I couldn't join discussion earlier. As I understand, your problem can be solved by a separate crate (bincode + some glue code or using future tarcieri's crate) and you do not currently require any changes in the digest
? In that case I think this issue can be closed.
There is a lingering question as to whether the Input
trait should be replaced by std::io::Write
or something like that. I haven't thought about it really myself.
@burdges
There is two problems with using std::io::Write
. First: there is no core::io
, so it conflicts with the goal to support no_std
whenever possible. Second: its methods return Result
while hashes can process any data without an error, so using Write
will result in a lot of useless unwrap
s.
I think the current digest_reader
is more than enough for convenient interfacing with the Read
implementors.
It would also be nice if a Serializer
was written so that one could hash straight from serde
without using bincode. But that would just be a convenience that should probably be behind a feature flag.
Should this issue be closed? I think it's semantically incorrect to implement Hasher
for this purpose, and there are plenty of other APIs more appropriate for hashing structured data in a cryptographically meaningful way.
@tarcieri the core of the issue is still unresolved. I've changed the title to reflect that. Right now the best way to use crypto hashes on structs is serde->bincode->Digest
which is far from ideal.
@pedrocr
Am I correct that you want to add the following method to Digest
trait or something similar?
fn digest_serialized<S: Serialize, F: Serializer>(s: S, f: F) -> Result<Output<Self::OutputSize>, F::Error>;
UPD: edited code a bit, removed bincode method
@newpavlov Don't know why you'd want the bincode one, the idea would be to not need bincode at all to be able to hash something that implements Serialize
. I also don't think the first one is the correct one. From what I've seen what's needed is some sort of DigestSerializer
that implements the Serializer
trait from serde by feeding all the inputs into a Digest
.
Also I'm not fully convinced the correct solution for this problem is with serde at all. There very likely cases where you want to serialize a field that you don't want to include in the hash and vice versa.
@pedrocr
To hash data you first need to convert it to the stream of bytes, i.e. serialize it, I don't think it will be right to choose or to develop serializator inside digest
crate.
So I think it's better to start from the separate crate which will utilize Digest
trait (or just one hash) and some serialization strategy (be it bincode
or something hand-written) to implement digest version of Serializer
.
I don't mind to have this crate as part of the RustCrypto project, but I'll not be able to work on this crate as currently I have other priorities for the project.
P.S.: After reading some docs it seems that code I've wrote in the previous comment will not work, as it's impossible to extract Writer from serializer and there is currently no traits for describing what we need.
As someone working specifically on structured hashing schemes, I'd also suggest trying to collaborate on ones which are reusable across multiple languages instead of making yet another one which is proprietary to Rust.
@newpavlov To hash data you first need to convert it to the stream of bytes, i.e. serialize it
Not exactly, a serialization requires the possibility to deserialize and thus includes more bytes than strictly needed. A Serializer
that just feeds a Digest
is by definition one-way and thus the only thing it needs to do is convert all fields to bytes in order and feed the hash, it doesn't have to do much serialization. But I agree that Serialize
is the wrong abstraction for this as you will not always want to serialize the same fields as you want to hash and vice versa.
So I think it's better to start from the separate crate which will utilize
Digest
(or just one hash) and some serialization strategy (be itbincode
or something hand-written) to implement digest version of Serializer.
Hashing the bincode
serialization is trivial to do, that's what I've already done in my project. Publishing that as a mini-crate would be easy if there's interest.
@tarcieri As someone working specifically on structured hashing schemes, I'd also suggest trying to collaborate on ones which are reusable across multiple languages instead of making yet another one which is proprietary to Rust.
Structured hashing schemes are not the same thing as hashing structs. It would be nice if Hash
was usable for hashing structs independently of structured hashing schemes. Since that is broken in multiple ways it would probably be nice if there was a usable #[derive(CryptoHash)]
somewhere. But since serde->bincode->Digest
works well enough it's probably what we'll end up with.
Structured hashing schemes are not the same thing as hashing structs.
"Recursively hashing structs" is pretty much the definition of a structured hashing scheme. However, you seem to be using it to mean "a content hash for serialized bincode" which, to me, is something very different from "recursively hashing structs"
It would be nice if
Hash
was usable for hashing structs independently of structured hashing schemes.
As covered earlier, this is simply the wrong tool for the job.
Since that is broken in multiple ways it would probably be nice if there was a usable #[derive(CryptoHash)] somewhere. But since serde->bincode->Digest works well enough it's probably what we'll end up with.
So you want to invent your own scheme, which is what I was recommending against. At least think about taking one of these two paths as you invent your own scheme:
1) Canonicalization: how do you deal with hashing str
and String
? (i.e. do you apply Unicode canonicalization first?) Someone will want to hash types like HashMap
eventually. Make sure there's a canonical representation.
2) Structured hashing: nicely sidesteps the canonicalization problem while providing a bunch of added value at the cost of relatively little complexity. These schemes can be as fast or faster than canonicalization schemes, because they avoid serializing the data as an intermediate step.
Done correctly you can solve all of the problems I just mentioned and have a format that works across languages. Or you can invent your own thing that's specific to Rust, ala bincode, but please think through the issues that have been addressed in all of the content hashing schemes that have been created in the past.
You are walking well-worn ground here, where many mistakes have been made by past naive attempts and much hard-won knowledge has been accrued. Before you go invent something, please make sure you study the past.
@tarcieri "Recursively hashing structs" is pretty much the definition of a structured hashing scheme. However, you seem to be using it to mean "a content hash for serialized bincode" which, to me, is something very different from "recursively hashing structs"
I've stated multiple times I don't want to use bincode for anything and it's just a hack to get it to work right now. So no, I don't want to hash bincode, all I want is a recursive crypto hash of arbitrary rust structs. I don't want to make any canonical representations of data. I don't want to serialize and then hash. I don't want to invent a hashing scheme, just have a hashing trait that isn't broken like the default Hash
.
all I want is a recursive crypto hash of arbitrary rust structs
I don't want to serialize and then hash
Then, by definition, you need an algorithm that knows how to compute cryptographic hashes of structured data.
Better make sure it isn't vulnerable to second-preimage attacks.
@tarcieri as far as I can tell that only applies if you are hashing individual structs and then hashing the concatenated hashes. I don't want that at all. I just want to initialize the hash, pass in all the values recursively, and finalize the hash. This has the same properties as serializing and then hashing without the performance penalty.
You need to domain separate the values somehow, either by length or by structure, and in either case you'll want to encode the types. The former pretty much involves a bincode-like scheme, and the latter an objecthash one.
Otherwise, you'll have ambiguities where different structures compute the same hash.
You need to domain separate the values somehow, either by length or by structure, and in either case you'll want to encode the types.
None of these are an issue for me.
Otherwise, you'll have ambiguities where different structures compute the same hash.
That's how Hash
already works and is fine for me.
Ok, so it sounds like you want to abandon security... in which case why are you using a cryptographic hash?
@tarcieri none of these abandon security but even if they did I'm using hashes for caching. All I need is a statistical guarantee of non-collision absent an attacker.
If you're using hashes like this in any sort of security context and have ambiguities in a structured hashing format, an attacker can exploit them to trick you into accepting documents containing e.g. nested content and have it verify as if it was the same as a non-nested message. Second preimage attacks on naive Merkle Trees are an example of this, and it's similar in concept to other attacks regarding confusion of message structure such as packet injection attacks.
If you want to build a naive hashing scheme using cryptographically secure hashes for your own purposes, that's fine, but I would be strongly opposed to the inclusion of naive schemes in cryptography libraries, because people generally expect cryptography libraries to be built for security purposes.
@tarcieri I suggest you bring this up with the rust lib developers then. All I want is to get a hash calculated the exact same way Hash
already works. I only want a crypto hash because 64bits are not enough collision protection for the application. If there's a way to create collisions easily with that kind of interface rust in general is vulnerable to them and that can be exploited to DoS systems at the very least. You haven't given an example of how that would happen though. All the cases you've pointed out are hashes of hashes situations.
I suggest you bring this up with the rust lib developers then.
Hash
isn't intended to be a cryptographic hash for identifying content. It's for use in hash-like data structures, where collisions are easily tolerated by the underlying algorithms. The main threat to those is hashDoS, where an attacker can deliberately exploit repeated collisions to the same value. The Rust core developers addressed those by using SipHash, a PRF with a random key.
You're talking about creating a CryptoHash
trait, which to me would ostensibly be for the purposes of secure content authentication. If you don't want it to be about security, I would suggest taking "crypto" out of the name.
You haven't given an example of how that would happen though.
These same attacks can happen without hashes-of-hashes. As I pointed out the general attack is one of failure to disambiguate message structures. Packet injection is the same basic idea, without any hashing involved whatsoever. That's a parsing ambiguity, but the same concept applies: the security system is confused about the message structure, allowing an attacker to trick it into doing something it shouldn't.
In my opinion if any messages with different structures can realistically produce the same content hash, the scheme is broken from a security perspective.
The Rust core developers addressed those by using SipHash, a PRF with a random key.
From what I've seen the current implementation doesn't actually randomize the key. But even if it did that's not relevant if it's trivial to generate a collision by just manipulating the structure.
These same attacks can happen without hashes-of-hashes. (...) In my opinion if any messages with different structures can realistically produce the same content hash, the scheme is broken from a security perspective.
I'd agree I just haven't found a case where that can happen with a setup like that of Hash
. I'm curious to see one as I believe it would be a bug in Hash
as well (or in some implementations of the trait).
From what I've seen the current implementation doesn't actually randomize the key.
Hash
is just a trait. Some implementations defend against hashDoS and some don't.
According to the documentation, the implementation used by HashMap
specifically implements the randomly-keyed cryptographic hash strategy (and uses std::hash::SipHasher
I believe):
https://doc.rust-lang.org/std/collections/struct.HashMap.html
By default,
HashMap
uses a hashing algorithm selected to provide resistance against HashDoS attacks. The algorithm is randomly seeded, and a reasonable best-effort is made to generate this seed from a high quality, secure source of randomness provided by the host without blocking the program
At least on my machine the hash is not seeded. This always returns the same value:
https://gist.github.com/pedrocr/8d0283bed3f56e5cb6fb9fe0a785f947
But that's not relevant to the point as I specifically mentioned. If you can generate structs that have the same hashing because you manipulate the structure it doesn't matter that your hash function is randomly keyed. Within the same execution of the program those structs will colide, just with different values. And that's what would be the bug in Hash
.
At least on my machine the hash is not seeded.
Use a HashMap
and I'll expect you'll see different results. Hash
itself is NOT designed to be a security primitive.
And to reiterate yet again, in the context where Hash
security matters (i.e. HashMap
and SipHasher
), we're operating under a very different threat model where collisions are tolerated. Hash-like data structures naturally accommodate collisions, but are susceptible to an algorithmic attack if they rebalance to the same value repeatedly.
You're colluding threat models here: Hash
and CryptoHash
operate under different threat models.
In the one I would find acceptable for CryptoHash
, collisions for attacker-controlled documents are not acceptable.
I'm not sure this is continuing to be a productive conversation: I would like the threat of ambiguous messages to be solved strategically for anything named CryptoHash
. This is a well-studied problem in cryptography with a limited number of solutions.
I would find a hand-rolled naive and ambiguous scheme unacceptable. You haven't made a concrete proposal, but you're downplaying everything you need to do to solve these problems from a security perspective, and that's why I'm worried.
Hash-like data structures naturally accommodate collisions, but are susceptible to an algorithmic attack if they rebalance to the same value repeatedly.
And that's the attack that if it works with CryptoHash
will also break Hash
. HashMap tolerates collisions is they are fully random. After all it only has 64 bits of key. But if it's possible to generate a collision by just manipulating structure Hash
and CryptoHash
are equally broken.
I would find a hand-rolled naive and ambiguous scheme unacceptable. You haven't made a concrete proposal, but you're downplaying everything you need to do to solve these problems from a security perspective, and that's why I'm worried.
My proposal is very simple. Do exactly what Hash
does but feed it to a crypto hash. I've yet to see an attack that breaks that but it may exist. If it does Hash
should probably be fixed as well. So far the only case I've seen that's strange is that hashing a zero sized struct is a no-op. Not even that is exploitable in rust as far as I can tell. Other than that things like ("foo","bar")
and ("foobar","")
hash to different values already. Do you have an example where this breaks down?
std::hash::Hasher
can be derived for structs and is a standard hashing interface in rust. The standard interface only allows 64bit outputs but there's nothing stopping extra outpust tailored to specific hashes. So for ergonomic purposes wouldn't it make sense to have an adapter to allow using theHasher
API?